Thread: Backups using non-consistent snapshots (e.g. COW reflinks)

Backups using non-consistent snapshots (e.g. COW reflinks)

From

"Wilson, Chris"

Date:

18 December 2023, 12:51:04

Dear fellow PG admins,

Every day we take a basebackup of our main database, which is about 1.5 TB. We store the backup on the same volume, so that it’s easily available if we need to do an emergency recovery (e.g. we can flip the directories and do a point-in-time restore), and we also tar, compress and encrypt it for archival (and then restore that backup to a testing environment to make sure that our backups actually work).

The database volume is on a networked block device (similar to iSCSI) which deduplicates the data. The I/O from taking this basebackup and rewriting the local backup directory is quite heavy on network, storage device CPU and disks, and our IT have asked us to reduce it. While we could do a basebackup in tar format and stream it through compression and encryption, we’d lose the ability to restore it quickly. Snapshots on the block device are an option, but they’re controlled by IT, and we don’t have direct access to them or the ability to automatically create or mount new snapshots.

Another option, since the database volume uses XFS, is taking a reflink copy (COW) onto the same filesystem, which is nearly instantaneous (rm -rf $backup && cp -a –reflink $PGDATA $backup) and avoids actually copying any data. This is however not a consistent snapshot. The manual says:

“the physical backup doesn't have to be an instantaneous snapshot of the database state — if it is made over some period of time, then replaying the WAL for that period will fix any internal inconsistencies.”

However it’s not clear to me exactly what the requirements for this are – especially how Postgres knows where to start replaying the WAL from. (That may not be the only risk of taking backups this way, but it’s the only one that I can think of apart from bugs in the filesystem and cp. I am assuming that all required WAL segments will be available as we keep them in the archive for much longer than the checkpoint interval).

The manual also says:

“After a checkpoint has been made and the WAL flushed, the checkpoint's position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs the REDO operation by scanning forward from the WAL location indicated in the checkpoint record.” (emphasis mine)

So if I save a copy of pg_control first (before the backup starts) and then place it inside the backup directory at the end, that might be enough to ensure that recovery cover the entire period while the backup was running. But what exactly is this “checkpoint record” – is it in the WAL? And if so, what is the location that it indicates? Is the record written at the end of the checkpoint, and it points back to the WAL location at the beginning of the checkpoint?

Or is it enough to run pg_start_backup before taking the copy? Will this prevent any additional checkpoints from being taken (once the function returns) until pg_stop_backup is called? Are there any other risks that I’m missing?

Thanks in advance for your help and advice. I will try to submit updates to the manual with what I learn.

Thanks, Chris.

This email was sent by and on behalf of GAM Investments. GAM Investments is the corporate brand for GAM Holding AG and its direct and indirect subsidiaries. These companies may be referred to as ‘GAM’ or ‘GAM Investments’. In the United Kingdom, the business of GAM Investments is conducted by GAM (U.K.) Limited (No. 01664573) or one or more entities under the control of GAM (U.K.) Limited, including the following entities authorised and regulated by the Financial Conduct Authority: GAM International Management Limited (No. 01802911), GAM London Limited (No. 00874802), GAM Sterling Management Limited (No. 01750352), GAM Unit Trust Management Company Limited (No. 2873560) and GAM Systematic LLP (No. OC317557). GAM (U.K.) Limited and its regulated entities are registered in England and Wales. The registered office and principal place of business of GAM (U.K.) Limited and its regulated entities is at 8 Finsbury Circus, London, England, EC2M 7GB. The registered office of GAM Systematic LLP is at City House, Hills Road, Cambridge, CB2 1RE. This email, and any attachments, is confidential and may be privileged or otherwise protected from disclosure. It is intended solely for the stated addressee(s) and access to it by any other person is unauthorised. If you are not the intended recipient, you must not disclose, copy, circulate or in any other way use or rely on the information contained herein. If you have received this email in error, please inform us immediately and delete all copies of it. See - https://www.gam.com/en/legal/email-disclosures-eu/ for further information on confidentiality, the risks of non-secure electronic communication, and certain disclosures which we are required to make in accordance with applicable legislation and regulations. If you cannot access this link, please notify us by reply message and we will send the contents to you. GAM Investments will collect and use information about you in the course of your interactions with us. Full details about the data types we collect and what we use this for and your related rights is set out in our online privacy policy at https://www.gam.com/en/legal/privacy-policy. Please familiarise yourself with this policy and check it from time to time for updates as it supplements this notice.

Re: Backups using non-consistent snapshots (e.g. COW reflinks)

From

Stephen Frost

Date:

18 December 2023, 14:12:25

Greetings,

* Wilson, Chris (Chris.Wilson@gam.com) wrote:
> Every day we take a basebackup of our main database, which is about 1.5 TB. We store the backup on the same volume,
sothat it's easily available if we need to do an emergency recovery (e.g. we can flip the directories and do a
point-in-timerestore), and we also tar, compress and encrypt it for archival (and then restore that backup to a testing
environmentto make sure that our backups actually work). 

pg_basebackup is a good tool for doing one-off backups of smaller
systems but for larger environments you should seriously consider using
tools which are specifically developed and designed for PG backups, such
as pgbackrest.

Glad to hear that you test your backup restore process.  Not sure what
all is involved in that based on your above comment but I'd strongly
encourage doing things like: perform the restore, bring PG up and have
it replay the WAL to some point in time, then use tools like amcheck to
validate indexes and pg_dump/pg_restore to a newly created system to
ensure all of the heap pages are able to be read and that all
constraints in the DB are still valid.  Bonus- the pg_dump output can
be kept around to provide a quick way to restore an individual table.

> The database volume is on a networked block device (similar to iSCSI) which deduplicates the data. The I/O from
takingthis basebackup and rewriting the local backup directory is quite heavy on network, storage device CPU and disks,
andour IT have asked us to reduce it. While we could do a basebackup in tar format and stream it through compression
andencryption, we'd lose the ability to restore it quickly. Snapshots on the block device are an option, but they're
controlledby IT, and we don't have direct access to them or the ability to automatically create or mount new snapshots. 

pgbackrest supports differential and incremental backups which could be
taken instead of always doing full backups as you're doing with
pg_basebackup.  You should still regularly do full backups though, of
course.

> Another option, since the database volume uses XFS, is taking a reflink copy (COW) onto the same filesystem, which is
nearlyinstantaneous (rm -rf $backup && cp -a -reflink $PGDATA $backup) and avoids actually copying any data. This is
howevernot a consistent snapshot. The manual<https://www.postgresql.org/docs/16/wal-intro.html> says: 
>
> "the physical backup doesn't have to be an instantaneous snapshot of the database state - if it is made over some
periodof time, then replaying the WAL for that period will fix any internal inconsistencies." 

This is true when you're making sure to use the low-level backup system
provided by PG- see link below for documentation on that.

> However it's not clear to me exactly what the requirements for this are - especially how Postgres knows where to
startreplaying the WAL from. (That may not be the only risk of taking backups this way, but it's the only one that I
canthink of apart from bugs in the filesystem and cp. I am assuming that all required WAL segments will be available as
wekeep them in the archive for much longer than the checkpoint interval). 
>
> The manual also says<https://www.postgresql.org/docs/16/wal-internals.html>:
>
> "After a checkpoint has been made and the WAL flushed, the checkpoint's position is saved in the file pg_control.
Therefore,at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs
theREDO operation by scanning forward from the WAL location indicated in the checkpoint record." (emphasis mine) 
>
> So if I save a copy of pg_control first (before the backup starts) and then place it inside the backup directory at
theend, that might be enough to ensure that recovery cover the entire period while the backup was running. But what
exactlyis this "checkpoint record" - is it in the WAL? And if so, what is the location that it indicates? Is the record
writtenat the end of the checkpoint, and it points back to the WAL location at the beginning of the checkpoint? 

Do not, do not, do not try to hack up your own process for backup and
restore- read the PG docs on how to do this.  There's far too many ways
to get a copy of the data directory that *looks* like a valid backup and
may even be valid in many cases, except that one time you need to
actually restore you discover that the process wasn't correct and you
ended up without a valid backup.

Even better- use a tool that already exists and does all of this
correctly for you instead of trying to develop your own.

> Or is it enough to run pg_start_backup before taking the copy? Will this prevent any additional checkpoints from
beingtaken (once the function returns) until pg_stop_backup is called? Are there any other risks that I'm missing? 

If you're doing a backup of PG, you need to use pg_backup_start/stop (or
pg_start_backup/pg_stop_backup in older versions) and you need to make
sure to store the backup_label (returned from pg_backup_stop) file with
the backup- that's what tells PG where it needs to start WAL replay
from.

You also need to fully read all of the documentation on this here:

https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP

and possibly even get into the PG source code if you want to develop a
serious backup tool for PG (eg: there is zero point copying/backing up
UNLOGGED tables but we don't discuss things like this in the backup
documentation nor how to figure out which files are associated with
UNLOGGED tables and which aren't...).

> Thanks in advance for your help and advice. I will try to submit updates to the manual with what I learn.

Certainly would appreciate any suggestions you have for improving the
manual.

Thanks,

Stephen

Attachment

signature.asc