Thread: Backups using non-consistent snapshots (e.g. COW reflinks)
Dear fellow PG admins,
Every day we take a basebackup of our main database, which is about 1.5 TB. We store the backup on the same volume, so that it’s easily available if we need to do an emergency recovery (e.g. we can flip the directories and do a point-in-time restore), and we also tar, compress and encrypt it for archival (and then restore that backup to a testing environment to make sure that our backups actually work).
The database volume is on a networked block device (similar to iSCSI) which deduplicates the data. The I/O from taking this basebackup and rewriting the local backup directory is quite heavy on network, storage device CPU and disks, and our IT have asked us to reduce it. While we could do a basebackup in tar format and stream it through compression and encryption, we’d lose the ability to restore it quickly. Snapshots on the block device are an option, but they’re controlled by IT, and we don’t have direct access to them or the ability to automatically create or mount new snapshots.
Another option, since the database volume uses XFS, is taking a reflink copy (COW) onto the same filesystem, which is nearly instantaneous (rm -rf $backup && cp -a –reflink $PGDATA $backup) and avoids actually copying any data. This is however not a consistent snapshot. The manual says:
“the physical backup doesn't have to be an instantaneous snapshot of the database state — if it is made over some period of time, then replaying the WAL for that period will fix any internal inconsistencies.”
However it’s not clear to me exactly what the requirements for this are – especially how Postgres knows where to start replaying the WAL from. (That may not be the only risk of taking backups this way, but it’s the only one that I can think of apart from bugs in the filesystem and cp. I am assuming that all required WAL segments will be available as we keep them in the archive for much longer than the checkpoint interval).
The manual also says:
“After a checkpoint has been made and the WAL flushed, the checkpoint's position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs the REDO operation by scanning forward from the WAL location indicated in the checkpoint record.” (emphasis mine)
So if I save a copy of pg_control first (before the backup starts) and then place it inside the backup directory at the end, that might be enough to ensure that recovery cover the entire period while the backup was running. But what exactly is this “checkpoint record” – is it in the WAL? And if so, what is the location that it indicates? Is the record written at the end of the checkpoint, and it points back to the WAL location at the beginning of the checkpoint?
Or is it enough to run pg_start_backup before taking the copy? Will this prevent any additional checkpoints from being taken (once the function returns) until pg_stop_backup is called? Are there any other risks that I’m missing?
Thanks in advance for your help and advice. I will try to submit updates to the manual with what I learn.
Thanks, Chris.
Greetings, * Wilson, Chris (Chris.Wilson@gam.com) wrote: > Every day we take a basebackup of our main database, which is about 1.5 TB. We store the backup on the same volume, sothat it's easily available if we need to do an emergency recovery (e.g. we can flip the directories and do a point-in-timerestore), and we also tar, compress and encrypt it for archival (and then restore that backup to a testing environmentto make sure that our backups actually work). pg_basebackup is a good tool for doing one-off backups of smaller systems but for larger environments you should seriously consider using tools which are specifically developed and designed for PG backups, such as pgbackrest. Glad to hear that you test your backup restore process. Not sure what all is involved in that based on your above comment but I'd strongly encourage doing things like: perform the restore, bring PG up and have it replay the WAL to some point in time, then use tools like amcheck to validate indexes and pg_dump/pg_restore to a newly created system to ensure all of the heap pages are able to be read and that all constraints in the DB are still valid. Bonus- the pg_dump output can be kept around to provide a quick way to restore an individual table. > The database volume is on a networked block device (similar to iSCSI) which deduplicates the data. The I/O from takingthis basebackup and rewriting the local backup directory is quite heavy on network, storage device CPU and disks, andour IT have asked us to reduce it. While we could do a basebackup in tar format and stream it through compression andencryption, we'd lose the ability to restore it quickly. Snapshots on the block device are an option, but they're controlledby IT, and we don't have direct access to them or the ability to automatically create or mount new snapshots. pgbackrest supports differential and incremental backups which could be taken instead of always doing full backups as you're doing with pg_basebackup. You should still regularly do full backups though, of course. > Another option, since the database volume uses XFS, is taking a reflink copy (COW) onto the same filesystem, which is nearlyinstantaneous (rm -rf $backup && cp -a -reflink $PGDATA $backup) and avoids actually copying any data. This is howevernot a consistent snapshot. The manual<https://www.postgresql.org/docs/16/wal-intro.html> says: > > "the physical backup doesn't have to be an instantaneous snapshot of the database state - if it is made over some periodof time, then replaying the WAL for that period will fix any internal inconsistencies." This is true when you're making sure to use the low-level backup system provided by PG- see link below for documentation on that. > However it's not clear to me exactly what the requirements for this are - especially how Postgres knows where to startreplaying the WAL from. (That may not be the only risk of taking backups this way, but it's the only one that I canthink of apart from bugs in the filesystem and cp. I am assuming that all required WAL segments will be available as wekeep them in the archive for much longer than the checkpoint interval). > > The manual also says<https://www.postgresql.org/docs/16/wal-internals.html>: > > "After a checkpoint has been made and the WAL flushed, the checkpoint's position is saved in the file pg_control. Therefore,at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs theREDO operation by scanning forward from the WAL location indicated in the checkpoint record." (emphasis mine) > > So if I save a copy of pg_control first (before the backup starts) and then place it inside the backup directory at theend, that might be enough to ensure that recovery cover the entire period while the backup was running. But what exactlyis this "checkpoint record" - is it in the WAL? And if so, what is the location that it indicates? Is the record writtenat the end of the checkpoint, and it points back to the WAL location at the beginning of the checkpoint? Do not, do not, do not try to hack up your own process for backup and restore- read the PG docs on how to do this. There's far too many ways to get a copy of the data directory that *looks* like a valid backup and may even be valid in many cases, except that one time you need to actually restore you discover that the process wasn't correct and you ended up without a valid backup. Even better- use a tool that already exists and does all of this correctly for you instead of trying to develop your own. > Or is it enough to run pg_start_backup before taking the copy? Will this prevent any additional checkpoints from beingtaken (once the function returns) until pg_stop_backup is called? Are there any other risks that I'm missing? If you're doing a backup of PG, you need to use pg_backup_start/stop (or pg_start_backup/pg_stop_backup in older versions) and you need to make sure to store the backup_label (returned from pg_backup_stop) file with the backup- that's what tells PG where it needs to start WAL replay from. You also need to fully read all of the documentation on this here: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP and possibly even get into the PG source code if you want to develop a serious backup tool for PG (eg: there is zero point copying/backing up UNLOGGED tables but we don't discuss things like this in the backup documentation nor how to figure out which files are associated with UNLOGGED tables and which aren't...). > Thanks in advance for your help and advice. I will try to submit updates to the manual with what I learn. Certainly would appreciate any suggestions you have for improving the manual. Thanks, Stephen