2.6. Backup and Recovery #
This section describes basics of backup and recovery in Shardman.
You can use the backup
command of the shardmanctl tool to perform a full binary consistent backup of a Shardman cluster to a shared directory or local directory (if --use-ssh
is specified) and the recover
command to perform a recovery from this backup.
Also you can use the probackup backup
command of the shardmanctl tool to perform a full binary consistent backup of a Shardman cluster to the backup repository on the local host or S3-compatible object storage and the probackup restore
command to perform a recovery from any backup from the repository.
The PostgreSQL pg_probackup utility for creating consistent full and incremental backups was integrated into shardman-utils. shardman-utils uses the pg_probackup approach to store backups in a pre-created repository. In addition, the pg_probackup commands archive-get
and archive-push
are used to deliver WAL logs into the backup repository. Backup and restore modes use a passwordless ssh connection between the cluster nodes and the backup node.
Shardman cluster configuration parameter enable_csn_snapshot must be set to on
. This parameter is necessary for the cluster backup to be consistent. If this option is disabled, a consistent backup is not possible.
For consistent visibility of distributed transactions, the technique of global snapshots based on physical clocks is used. Similarly, it is possible to get a consistent snapshot for backups, only the time corresponding to the global snapshot must be mapped to the set of LSNs for each node. Such a set of consistent LSNs in a cluster is called a syncpoint. By getting the syncpoint and taking the LSN for each node in the cluster from it, we can make a backup of each node, which must necessarily contain that LSN. We can also recover to this LSN using the point in time recovery (PITR) mechanism.
The backup
and probackup
commands use different mechanisms to create backups. The backup
command is based on the standard utilities pg_basebackup and pg_receivewal. The probackup
command uses the pg_probackup utility and its options to create a cluster backup. In any case of using backup
or probackup
commands for restoration, the node names, defined by hostname or IP-address, must correspond to those that were in place at the time of the backup.
2.6.1. Cluster Backup with pg_basebackup #
This section describes basics of backup and recovery in Shardman with the basebackup
command.
2.6.1.1. Requirements #
To backup and restore a Shardman cluster via the basebackup
command, the following requirements must be met:
Shardman cluster configuration parameter enable_csn_snapshot must be
on
. This parameter is necessary for the cluster backup to be consistent. If this parameter is disabled, a consistent backup is not possible.On each Shardman cluster node, Shardman utilities must be installed into
/opt/pgpro/sdm-14/bin
.On each Shardman cluster node, pg_basebackup must be installed into
/opt/pgpro/sdm-14/bin
.On each Shardman cluster node,
postgres
Linux user and group must be created.Passwordless SSH connection between Shardman cluster nodes for the
postgres
Linux user must be configured.If the
--use-ssh
flag isn't specified, all Shardman cluster nodes must be connected to a shared network storage and backup folder must be created on that shared network storage.If the
--use-ssh
flag is specified, the backup directory can be created on the local storage on the node whererecover
will be called.Access for the
postgres
Linux user to the backup folder must be granted.shardmanctl utility must be run as
postgres
Linux user.
2.6.1.2. basebackup Backup Process #
shardmanctl conducts a backup task in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations.
Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.
Creates replication slots on each replication group to ensure that WAL records are not lost.
Dumps Shardman metadata stored in etcd to a JSON file in the backup directory.
To get backups from each replication group, concurrently runs pg_basebackup using replication slots created.
Creates the syncpoint and uses pg_receivewal to fetch WAL logs generated after finishing each basebackup until LSNs extracted from syncpoint are reached.
Fixes partial WAL files generated by pg_receivewal and creates the backup description file.
2.6.2. Cluster Recovery from a Backup Using pg_basebackup #
You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant.
shardmanctl can perform either full restore, metadata-only or schema-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.
During metadata-only restore, shardmanctl restores etcd data from the dump created during the backup.
Important
Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.
Schema-only recovery restore only schema information without data. It can be useful if the scale of the data is large and the schema is needed for testing or checking.
During a full restore, shardmanctl checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.
shardmanctl probackup restore
can restore a working or partially working cluster from a backup that was created on a working or partially working cluster.
Also you can perform restoring only on a single shard using --shard
parameter.
shardmanctl conducts full restore in several steps. The tool:
Takes the necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.
Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.
When the correct metadata is in place, runs stolon
init
in PITR initialization mode withRecoveryTargetName
set to the value of the syncpoint LSN from the backup info file.DataRestoreCommand
andRestoreCommand
are also taken from the backup info file.Waits for each replication group to recover.
2.6.3. Cluster Backup with pg_probackup #
This section describes basics of backup and recovery in Shardman with the probackup
command.
You can use the probackup backup
command of the shardmanctl tool to perform binary backups of a Shardman cluster into the backup repository on the local (backup) host and the probackup restore
command to perform a recovery from the selected backup. Full and partial (delta) backups are supported.
2.6.3.1. Requirements #
To backup and restore a Shardman cluster via the probackup
command, the following requirements must be met:
Shardman cluster configuration parameter enable_csn_snapshot must be
on
. This parameter is necessary for the cluster backup to be consistent. If this parameter is disabled, a consistent backup is not possible.On the backup host, Shardman utilities must be installed into
/opt/pgpro/sdm-14/bin
.On the backup host and on each cluster node, pg_probackup must be installed into
/opt/pgpro/sdm-14/bin
.On the backup host,
postgres
Linux user and group must be created.Passwordless SSH connection between the backup host and each Shardman cluster node for the
postgres
Linux user must be configured. To do this, on each node:The
postgres
user must create the.ssh
subdirectory in the/var/lib/postgresql
directory and place there the keys required for the passwordless SSH connection.To perform a backup/restore in a pretty large number of threads, such as 50 (
-j
=50, see the section called “backup
” for details),MaxSessions
andMaxStartups
must be set to 100 for the backup host in the/etc/ssh/sshd_config
file.Note
Setting the number of threads (
-j
option) to a value greater than 10 forshardmanctl probackup
may result in the actual number of SSH connections exceeding the maximum allowed number of simultaneous SSH connections on the backup host and consequently lead to an “ERROR: Agent error: kex_exchange_identification: Connection closed by remote host” error. To correct the error, either reduce the number ofprobackup
threads or adjust the value ofMaxStartups
configuration parameter of the backup host. If SSH is set up as a xinetd service on the backup host, adjust the value of the xinetdper_source
configuration parameter rather thanMaxStartups
.
You can disable SSH for data copying by setting the
--storage-type
option to themount
orS3
value (but SSH will be required to execute remote commands). Also this value will be automatically used in the restore process.A backup folder or bucket in the S3-compatible object storage must be created.
Access for the
postgres
Linux user to the backup folder must be granted.shardmanctl utility must be run as
postgres
Linux user.init
subcommand for the backup repository initialization must be successfully executed on the backup host.archive-command add
subcommand for enablingarchive_command
for each replication group to stream WALs into the initialized repository must be successfully executed on the backup host.
2.6.3.2. pg_probackup Backup Process #
shardmanctl conducts a backup task in several steps. The tool:
Takes necessary locks in etcd to prevent concurrent cluster-wide operations.
Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.
Dumps Shardman metadata, stored in etcd, to a JSON file in the backup directory or bucket in the S3-compatible object storage.
To get backups from each replication group, concurrently runs pg_probackup using the configured
archive_command
.Creates the syncpoint and gets LSNs for each replication group from the syncpoint data structure. Then uses the
pg_probackup archive-push
command to push WAL logs generated after finishing backup and the WAL file where syncpoint LSNs are present for each replication group.
2.6.4. Cluster Restore from a Backup with pg_probackup #
You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant here.
Also, you can restore other clusters from the same backup if these clusters have the same topology.
shardmanctl can perform either full restore, metadata-only or schema-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.
During metadata-only restore, shardmanctl restores etcd data from the dump created during the backup.
Important
Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.
Schema-only recovery restore only schema information without data. It can be useful if the scale of the data is large and the schema is needed for testing or checking.
During a full restore, shardmanctl checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.
Also you can perform restoring only on the single shard using --shard
parameter.
Also you can perform Point-in-Time Recovery using --recovery-target-time
parameter. In this case Shardman finds closest syncpoint to specified timestamp and suggests to restore on found LSN. You can also specify a --wal-limit
option to limit the number of WAL segments to be processed.
shardmanctl conducts full restore in several steps. The tool:
Takes the necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.
Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.
When the correct metadata is in place, runs stolon
init
in PITR initialization mode withRecoveryTargetName
set to the value of the syncpoint LSN from the backup info file.DataRestoreCommand
andRestoreCommand
are also taken from the backup info file. These commands are generated automatically during the backup phase, it is not recommended to make any corrections to the file containing the Shardman cluster backup description. When restoring a cluster for each replication group, the WAL files containing the final LSN to restore will be requested automatically from the backup repository from the remote backup node via the pg_probackuparchive-get
command.Waits for each replication group to recover.
Finally we need to enable
archive_command
back.
When performing a sequential restoration in PostgreSQL, be cautious of potential timeline conflicts within WAL (Write-Ahead Logging) segments. This issue commonly arises when restoring a database from a backup that was created at a certain point in time. If the database continues to operate and generate WAL segments after this backup, these new WAL segments are associated with a different timeline. During restoration, if the system tries to replay WAL segments from a different timeline - one that diverged from the point of backup - it can lead to inconsistencies and conflicts. Additionally, after completing a restoration in PostgreSQL, it is strongly advised not to restore the database onto the same timeline or onto any timeline that precedes the one from which the backup was made.
2.6.5. Merging Backups with pg_probackup #
The more incremental backups are created, the bigger the total size of the backup catalog grows. To save the disk space, it is possible to merge the incremental backups to their parent full backup by running the merge command, specifying the backup ID of the most recent incremental backup to merge:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup merge --backup-path backup_dir --backup-id backup_id
This command merges the backups that belong to a common incremental backup chain. If a full backup is specified, it is merged with its first incremental backup. If an incremental backup is specified, it is merged to its parent full backup, along with all the incremental backups between them. Once the merge is complete, the full backup covers all the merged data, and the incremental backups are removed as redundant. Thus, the merge operation virtually equals to removing all the outdated backups from a full backup, but a lot faster, especially for the large data volumes. It also saves I/O and network traffic when using pg_probackup in the remote mode.
Before merging, pg_probackup validates all the affected backups to ensure that they are valid. The current backup status can be seen by running the show
command:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup show --backup-path backup_dir
For more information, see reference.
2.6.6. Deleting Backups with pg_probackup #
To delete a backup that is no longer needed, run the following command:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id
This command deletes a backup with a specified backup_id
, along with all the incremental backups that descend from this backup_id
, if any. It allows to delete some of the recent incremental backups, without affecting the underlying full backup and other incremental backups that follow it.
To delete the obsolete WAL files that are not needed for recovery, use the --delete-wal
flag:
$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id --delete-wal
For more information, see reference.