25.2. Backup and Recovery in Distributed System #
This section describes basics of backup and recovery in Postgres Pro Shardman.
You can use the probackup backup
command of the shardmanctl tool to perform a full binary consistent backup of a Postgres Pro Shardman cluster to the backup repository on the local host or S3-compatible object storage and the probackup restore
command to perform a recovery from any backup from the repository. Full and partial (DELTA) backups are supported.
The Postgres Pro Shardman pg_probackup utility for creating consistent full and incremental backups was integrated into shardman-utils. shardman-utils uses the pg_probackup approach to store backups in a pre-created repository. In addition, the pg_probackup commands archive-get
and archive-push
are used to deliver WAL logs into the backup repository. Backup and restore modes use a passwordless SSH connection between the cluster nodes and the backup node.
Postgres Pro Shardman cluster configuration parameter enable_csn_snapshot must be set to on
. This parameter is necessary for the cluster backup to be consistent. If this option is disabled, a consistent backup is not possible.
For consistent visibility of distributed transactions, the technique of global snapshots based on physical clocks is used. Similarly, it is possible to get a consistent snapshot for backups, only the time corresponding to the global snapshot must be mapped to the set of LSNs for each node. Such a set of consistent LSNs in a cluster is called a syncpoint. By getting the syncpoint and taking the LSN for each node in the cluster from it, we can make a backup of each node, which must necessarily contain that LSN. We can also recover to this LSN using the point in time recovery (PITR) mechanism.
The probackup
command uses the pg_probackup utility and its options to create a cluster backup. In any case of using probackup
commands for restoration, the node names, defined by the hostname or IP-address, must correspond to those that were in place at the time of the backup.
25.2.1. Requirements for Backup and Restore with pg_probackup #
To backup and restore a Postgres Pro Shardman cluster via the probackup
command, the following requirements must be met:
On the backup host, Postgres Pro Shardman utilities must be installed into
/opt/pgpro/sdm-17/bin
.On the backup host and on each cluster node, pg_probackup must be installed into
/opt/pgpro/sdm-17/bin
.On the backup host,
postgres
Linux user and group must be created.Passwordless SSH connection between the backup host and each Postgres Pro Shardman cluster node for the
postgres
Linux user must be configured. To do this, on each node:The
postgres
user must create the.ssh
subdirectory in the/var/lib/postgresql
directory and place there the keys required for the passwordless SSH connection.To perform a backup/restore in a pretty large number of threads, such as 50 (
-j
=50, see the section called “backup
” for details),MaxSessions
andMaxStartups
must be set to 100 for the backup host in the/etc/ssh/sshd_config
file.Note
Setting the number of threads (
-j
option) to a value greater than 10 forshardmanctl probackup
may result in the actual number of SSH connections exceeding the maximum allowed number of simultaneous SSH connections on the backup host and consequently lead to an “ERROR: Agent error: kex_exchange_identification: Connection closed by remote host” error. To correct the error, either reduce the number ofprobackup
threads or adjust the value ofMaxStartups
configuration parameter of the backup host. If SSH is set up as a xinetd service on the backup host, adjust the value of the xinetdper_source
configuration parameter rather thanMaxStartups
.
You can disable SSH for data copying by setting the
--storage-type
option to themount
orS3
value (but SSH will be required to execute remote commands). Also this value will be automatically used in the restore process.A backup folder or bucket in the S3-compatible object storage must be created.
Access for the
postgres
Linux user to the backup folder must be granted.shardmanctl utility must be run as
postgres
Linux user.init
subcommand for the backup repository initialization must be successfully executed on the backup host.archive-command add
subcommand for enablingarchive_command
for each replication group to stream WALs into the initialized repository must be successfully executed on the backup host.
25.2.2. pg_probackup Backup Process #
shardmanctl conducts a backup task in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations.
Connects to a random replication group and locks Postgres Pro Shardman metadata tables to prevent modification of foreign servers during the backup.
Dumps Postgres Pro Shardman metadata, stored in etcd, to a JSON file in the backup directory or bucket in the S3-compatible object storage.
To get backups from each replication group, concurrently runs pg_probackup using the configured
archive_command
.Creates the syncpoint and gets LSNs for each replication group from the syncpoint data structure. Then uses the
pg_probackup archive-push
command to push WAL logs generated after finishing backup and the WAL file where syncpoint LSNs are present for each replication group.
For more information on the backup
command, see reference.
25.2.3. Cluster Restore from a Backup with pg_probackup #
You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Postgres Pro Shardman version and have the same number of replication groups are meant here.
shardmanctl can perform full restore, metadata-only restore, or schema-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.
During metadata-only restore, shardmanctl restores etcd data from the dump created during the backup.
Important
Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.
Schema-only recovery restores only schema information without data. It can be useful if the scale of the data is large and the schema is needed for testing or checking.
During a full restore, shardmanctl checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.
You can perform restoring only on the single shard using --shard
parameter.
Also you can perform Point-in-Time Recovery using --recovery-target-time
parameter. In this case, Postgres Pro Shardman finds the closest syncpoint to the specified timestamp and suggests restoring on that LSN. You can also specify a --wal-limit
option to limit the number of WAL segments to be processed.
shardmanctl conducts full restore in several steps. The tool:
Takes the necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.
Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.
When the correct metadata is in place, runs
init
in PITR initialization mode withRecoveryTargetName
set to the value of the syncpoint LSN from the backup info file.DataRestoreCommand
andRestoreCommand
are also taken from the backup info file. These commands are generated automatically during the backup phase, it is not recommended to make any corrections to the file containing the Postgres Pro Shardman cluster backup description. When restoring a cluster for each replication group, the WAL files containing the final LSN to restore will be requested automatically from the backup repository on the remote backup node via the pg_probackuparchive-get
command.Waits for each replication group to recover.
Finally we need to enable
archive_command
back.
When performing a sequential restoration in Postgres Pro Shardman, be cautious of potential timeline conflicts within WAL (Write-Ahead Logging) segments. This issue commonly arises when restoring a database from a backup that was created at a certain point in time. If the database continues to operate and generate WAL segments after this backup, these new WAL segments are associated with a different timeline. During restoration, if the system tries to replay WAL segments from a different timeline - the one that diverged from the point of backup - it can lead to inconsistencies and conflicts. Additionally, after completing a restoration in Postgres Pro Shardman, it is strongly advised not to restore the database onto the same timeline or onto any timeline that precedes the one from which the backup was made.
For more information, see reference.
25.2.4. Merging Backups with pg_probackup #
The more incremental backups are created, the larger the total size of the backup catalog. To save the disk space, it is possible to merge incremental backups to their parent full backup by running the merge command with the backup ID of the most recent incremental backup specified:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup merge --backup-path backup_dir --backup-id backup_id
This command merges the backups that belong to a common incremental backup chain. If a full backup is specified, it is merged with its first incremental backup. If an incremental backup is specified, it is merged to its parent full backup, along with all the incremental backups between them. Once the merge is complete, the full backup covers all the merged data, and the incremental backups are removed as redundant. Thus, the merge operation is virtually the same as removing all the outdated backups from a full backup, but is much faster, especially for the large data volumes. It also saves I/O and network traffic when using pg_probackup in the remote mode.
Before merging, pg_probackup validates all the affected backups to ensure that they are valid. The current backup status can be seen by running the show
command:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup show --backup-path backup_dir
For more information, see reference.
25.2.5. Deleting Backups with pg_probackup #
To delete a backup that is no longer needed, run the following command:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id
This command deletes a backup with a specified backup_id
, along with all the incremental backups that descend from this backup_id
, if any. It allows you to delete some of the recent incremental backups without affecting the underlying full backup and other incremental backups that follow it.
To delete obsolete WAL files that are not needed for recovery, use the --delete-wal
flag:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id --delete-wal
For more information, see reference.