2.6. Backup and Recovery #

This section describes basics of backup and recovery in Shardman.

You can use the backup command of the shardmanctl tool to perform a full binary consistent backup of a Shardman cluster to a shared directory or local directory (if --use-ssh is specified) and the recover command to perform a recovery from this backup.

Also you can use the probackup backup command of the shardmanctl tool to perform a full binary consistent backup of a Shardman cluster to the backup repository on the local host or S3-compatible object storage and the probackup restore command to perform a recovery from any backup from the repository.

The PostgreSQL pg_probackup utility for creating consistent full and incremental backups was integrated into shardman-utils. shardman-utils uses the pg_probackup approach to store backups in a pre-created repository. In addition, the pg_probackup commands archive-get and archive-push are used to deliver WAL logs into the backup repository. Backup and restore modes use a passwordless ssh connection between the cluster nodes and the backup node.

Shardman cluster configuration parameter enable_csn_snapshot must be set to on. This parameter is necessary for the cluster backup to be consistent. If this option is disabled, a consistent backup is not possible.

For consistent visibility of distributed transactions, the technique of global snapshots based on physical clocks is used. Similarly, it is possible to get a consistent snapshot for backups, only the time corresponding to the global snapshot must be mapped to the set of LSNs for each node. Such a set of consistent LSNs in a cluster is called a syncpoint. By getting the syncpoint and taking the LSN for each node in the cluster from it, we can make a backup of each node, which must necessarily contain that LSN. We can also recover to this LSN using the point in time recovery (PITR) mechanism.

The backup and probackup commands use different mechanisms to create backups. The backup command is based on the standard utilities pg_basebackup and pg_receivewal. The probackup command uses the pg_probackup utility and its options to create a cluster backup. In any case of using backup or probackup commands for restoration, the node names, defined by hostname or IP-address, must correspond to those that were in place at the time of the backup.

2.6.1. Cluster Backup with pg_basebackup #

This section describes basics of backup and recovery in Shardman with the basebackup command.

2.6.1.1. Requirements #

To backup and restore a Shardman cluster via the basebackup command, the following requirements must be met:

  • Shardman cluster configuration parameter enable_csn_snapshot must be on. This parameter is necessary for the cluster backup to be consistent. If this parameter is disabled, a consistent backup is not possible.

  • On each Shardman cluster node, Shardman utilities must be installed into /opt/pgpro/sdm-14/bin.

  • On each Shardman cluster node, pg_basebackup must be installed into /opt/pgpro/sdm-14/bin.

  • On each Shardman cluster node, postgres Linux user and group must be created.

  • Passwordless SSH connection between Shardman cluster nodes for the postgres Linux user must be configured.

  • If the --use-ssh flag isn't specified, all Shardman cluster nodes must be connected to a shared network storage and backup folder must be created on that shared network storage.

  • If the --use-ssh flag is specified, the backup directory can be created on the local storage on the node where recover will be called.

  • Access for the postgres Linux user to the backup folder must be granted.

  • shardmanctl utility must be run as postgres Linux user.

2.6.1.2. basebackup Backup Process #

shardmanctl conducts a backup task in several steps. The tool:

  1. Takes necessary locks in etcd to prevent concurrent cluster-wide operations.

  2. Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.

  3. Creates replication slots on each replication group to ensure that WAL records are not lost.

  4. Dumps Shardman metadata stored in etcd to a JSON file in the backup directory.

  5. To get backups from each replication group, concurrently runs pg_basebackup using replication slots created.

  6. Creates the syncpoint and uses pg_receivewal to fetch WAL logs generated after finishing each basebackup until LSNs extracted from syncpoint are reached.

  7. Fixes partial WAL files generated by pg_receivewal and creates the backup description file.

2.6.2. Cluster Recovery from a Backup Using pg_basebackup #

You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant.

shardmanctl can perform either full restore, metadata-only or schema-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.

During metadata-only restore, shardmanctl restores etcd data from the dump created during the backup.

Important

Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.

Schema-only recovery restore only schema information without data. It can be useful if the scale of the data is large and the schema is needed for testing or checking.

During a full restore, shardmanctl checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.

shardmanctl probackup restore can restore a working or partially working cluster from a backup that was created on a working or partially working cluster.

Also you can perform restoring only on a single shard using --shard parameter.

shardmanctl conducts full restore in several steps. The tool:

  1. Takes the necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.

  2. Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.

  3. When the correct metadata is in place, runs stolon init in PITR initialization mode with RecoveryTargetName set to the value of the syncpoint LSN from the backup info file. DataRestoreCommand and RestoreCommand are also taken from the backup info file.

  4. Waits for each replication group to recover.

2.6.3. Cluster Backup with pg_probackup #

This section describes basics of backup and recovery in Shardman with the probackup command.

You can use the probackup backup command of the shardmanctl tool to perform binary backups of a Shardman cluster into the backup repository on the local (backup) host and the probackup restore command to perform a recovery from the selected backup. Full and partial (delta) backups are supported.

2.6.3.1. Requirements #

To backup and restore a Shardman cluster via the probackup command, the following requirements must be met:

  • Shardman cluster configuration parameter enable_csn_snapshot must be on. This parameter is necessary for the cluster backup to be consistent. If this parameter is disabled, a consistent backup is not possible.

  • On the backup host, Shardman utilities must be installed into /opt/pgpro/sdm-14/bin.

  • On the backup host and on each cluster node, pg_probackup must be installed into /opt/pgpro/sdm-14/bin.

  • On the backup host, postgres Linux user and group must be created.

  • Passwordless SSH connection between the backup host and each Shardman cluster node for the postgres Linux user must be configured. To do this, on each node:

    • The postgres user must create the .ssh subdirectory in the /var/lib/postgresql directory and place there the keys required for the passwordless SSH connection.

    • To perform a backup/restore in a pretty large number of threads, such as 50 (-j=50, see the section called “ backup for details), MaxSessions and MaxStartups must be set to 100 for the backup host in the /etc/ssh/sshd_config file.

      Note

      Setting the number of threads (-j option) to a value greater than 10 for shardmanctl probackup may result in the actual number of SSH connections exceeding the maximum allowed number of simultaneous SSH connections on the backup host and consequently lead to an ERROR: Agent error: kex_exchange_identification: Connection closed by remote host error. To correct the error, either reduce the number of probackup threads or adjust the value of MaxStartups configuration parameter of the backup host. If SSH is set up as a xinetd service on the backup host, adjust the value of the xinetd per_source configuration parameter rather than MaxStartups.

    You can disable SSH for data copying by setting the --storage-type option to the mount or S3 value (but SSH will be required to execute remote commands). Also this value will be automatically used in the restore process.

  • A backup folder or bucket in the S3-compatible object storage must be created.

  • Access for the postgres Linux user to the backup folder must be granted.

  • shardmanctl utility must be run as postgres Linux user.

  • init subcommand for the backup repository initialization must be successfully executed on the backup host.

  • archive-command add subcommand for enabling archive_command for each replication group to stream WALs into the initialized repository must be successfully executed on the backup host.

2.6.3.2. pg_probackup Backup Process #

shardmanctl conducts a backup task in several steps. The tool:

  1. Takes necessary locks in etcd to prevent concurrent cluster-wide operations.

  2. Connects to a random replication group and locks Shardman metadata tables to prevent modification of foreign servers during the backup.

  3. Dumps Shardman metadata, stored in etcd, to a JSON file in the backup directory or bucket in the S3-compatible object storage.

  4. To get backups from each replication group, concurrently runs pg_probackup using the configured archive_command.

  5. Creates the syncpoint and gets LSNs for each replication group from the syncpoint data structure. Then uses the pg_probackup archive-push command to push WAL logs generated after finishing backup and the WAL file where syncpoint LSNs are present for each replication group.

2.6.4. Cluster Restore from a Backup with pg_probackup #

You can restore a backup on the same or compatible cluster. By compatible clusters, those that use the same Shardman version and have the same number of replication groups are meant here.

Also, you can restore other clusters from the same backup if these clusters have the same topology.

shardmanctl can perform either full restore, metadata-only or schema-only restore. Metadata-only restore is useful if issues are encountered with the etcd instance, but DBMS data is not corrupted.

During metadata-only restore, shardmanctl restores etcd data from the dump created during the backup.

Important

Restoring metadata to an incompatible cluster can lead to catastrophic consequences, including data loss, since the metadata state can differ from the actual configuration layout. Do not perform metadata-only restore if there were cluster reconfigurations after the backup, such as addition or deletion of nodes, even if the same nodes were added back again.

Schema-only recovery restore only schema information without data. It can be useful if the scale of the data is large and the schema is needed for testing or checking.

During a full restore, shardmanctl checks whether the number of replication groups in the target cluster matches the number of replication groups in the backup. This means that you cannot restore on an empty cluster, but need to add as many replication groups as necessary for the total number of them to match that of the cluster from which the backup was taken.

Also you can perform restoring only on the single shard using --shard parameter.

Also you can perform Point-in-Time Recovery using --recovery-target-time parameter. In this case Shardman finds closest syncpoint to specified timestamp and suggests to restore on found LSN. You can also specify a --wal-limit option to limit the number of WAL segments to be processed.

shardmanctl conducts full restore in several steps. The tool:

  1. Takes the necessary locks in etcd to prevent concurrent cluster-wide operations and tries to assign replication groups in the backup to existing replication groups. If it cannot do this (for example, due to cluster incompatibility), the recovery fails.

  2. Restores part of the etcd metadata: the cluster specification and parts of replication group definitions.

  3. When the correct metadata is in place, runs stolon init in PITR initialization mode with RecoveryTargetName set to the value of the syncpoint LSN from the backup info file. DataRestoreCommand and RestoreCommand are also taken from the backup info file. These commands are generated automatically during the backup phase, it is not recommended to make any corrections to the file containing the Shardman cluster backup description. When restoring a cluster for each replication group, the WAL files containing the final LSN to restore will be requested automatically from the backup repository from the remote backup node via the pg_probackup archive-get command.

  4. Waits for each replication group to recover.

  5. Finally we need to enable archive_command back.

When performing a sequential restoration in PostgreSQL, be cautious of potential timeline conflicts within WAL (Write-Ahead Logging) segments. This issue commonly arises when restoring a database from a backup that was created at a certain point in time. If the database continues to operate and generate WAL segments after this backup, these new WAL segments are associated with a different timeline. During restoration, if the system tries to replay WAL segments from a different timeline - one that diverged from the point of backup - it can lead to inconsistencies and conflicts. Additionally, after completing a restoration in PostgreSQL, it is strongly advised not to restore the database onto the same timeline or onto any timeline that precedes the one from which the backup was made.

2.6.5. Merging Backups with pg_probackup #

The more incremental backups are created, the bigger the total size of the backup catalog grows. To save the disk space, it is possible to merge the incremental backups to their parent full backup by running the merge command, specifying the backup ID of the most recent incremental backup to merge:

$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup merge --backup-path backup_dir --backup-id backup_id

This command merges the backups that belong to a common incremental backup chain. If a full backup is specified, it is merged with its first incremental backup. If an incremental backup is specified, it is merged to its parent full backup, along with all the incremental backups between them. Once the merge is complete, the full backup covers all the merged data, and the incremental backups are removed as redundant. Thus, the merge operation virtually equals to removing all the outdated backups from a full backup, but a lot faster, especially for the large data volumes. It also saves I/O and network traffic when using pg_probackup in the remote mode.

Before merging, pg_probackup validates all the affected backups to ensure that they are valid. The current backup status can be seen by running the show command:

$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup show --backup-path backup_dir

For more information, see reference.

2.6.6. Deleting Backups with pg_probackup #

To delete a backup that is no longer needed, run the following command:

$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id

This command deletes a backup with a specified backup_id, along with all the incremental backups that descend from this backup_id, if any. It allows to delete some of the recent incremental backups, without affecting the underlying full backup and other incremental backups that follow it.

To delete the obsolete WAL files that are not needed for recovery, use the --delete-wal flag:

$ shardmanctl --store-endpoints http://etcd1:2379,http://etcd2:2379,http://etcd3:2379 probackup delete --backup-path backup_dir --backup-id backup_id --delete-wal

For more information, see reference.

pdf