Re: Backup "Best Practices" - Mailing list pgsql-general

From Israel Brewster
Subject Re: Backup "Best Practices"
Date
Msg-id 3F8B94BA-467F-4CE6-80EF-C74EBC2AB3EF@ravnalaska.net
Whole thread Raw
In response to Re: Backup "Best Practices"  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Backup "Best Practices"  (John R Pierce <pierce@hogranch.com>)
Re: Backup "Best Practices"  (Jeff Janes <jeff.janes@gmail.com>)
Re: Backup "Best Practices"  (Thomas Kellerer <spam_eater@gmx.net>)
List pgsql-general
On Nov 25, 2016, at 1:00 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Nov 23, 2016 at 10:16 AM, Israel Brewster <israel@ravnalaska.net> wrote:
I was wondering if someone could inform me about, or point me to an online article about best practices for backing up a postgresql database cluster? At the moment, I have two servers running with streaming replication for failover purposes, and then I do nightly database dumps for recovery/backup purposes. However, I feel this approach is lacking in a couple areas I can think of:

- If a total failure occurs towards the end of the day, we could potentially loose a whole days worth of data.

Why wouldn't the streaming replica salvage that?  Are they expected to fail together?  Is the NFS share onto which you want to store your basebackup and WAL also expected to fail together with them?

That's why I specified *total* failure. If only the primary dies, that's a simple cutover to the secondary, and not a total failure - no need to go to backups of any sort under that scenario :-) I'm thinking more along the lines of server room burns down, natural disaster, etc - something that causes a total failure of the db system, necessitating recovery from an offsite backup.

 
Similar argument for user error - there is no way to "undo" a catastrophic user data error without going all the way back to the previous day
- Less-than-ideal recovery under some scenarios. Since each database in the cluster is dumped individually, this is good should only *one* database need to be restored, but could get tedious should the entire cluster need to be restored.

To mitigate these issues, I am thinking of supplementing the individual dumps with a full base backup and WAL archiving to a NFS share. This should enable (relatively) quick/easy recovery from backup, plus the ability to do PIT Recovery. I do have a few questions with this approach, however:

- How do I figure out how often I should take a full base backup? I know this will depend on the amount of traffic my database is doing, and how long I am willing to wait for WAL files to be replayed - the more WAL files needing replayed, the longer recovery will take - but is there some rule of thumb that I can use to calculate how often I need a new base backup? Perhaps based on the number of WAL files?

You have to try it and see.  Different types of wal records will take different amounts of time to re-play, so there is no rule of thumb. It would depend on the type of traffic you have in your database.  And it could be limited by a single CPU, or by IO.  If the restore_command needs to restore the WAL from a remote server, it is very likely to be limited by the latency of doing that.  In fact, this is often the bottleneck even if it is restoring from the local server, at least if archival is often driven by archive_timeout.

When I need to re-clone production to get a fresh server to use for dev or testing, I do so using almost exactly the same method I would use for restoring production from a disaster (restore from most recent basebackup, then recovery from WAL archive).  So I know how long it takes for the recovery to happen based on true experience, and I take a new basebackup when that length of time starts to annoy me.

Gotcha. Guess I'll have to start running some tests :-)

 
- What is the "best" (or just a good) method of keeping the WAL archives under control? Obviously when I do a new basebackup I can "cleanup" any old files that said backup doesn't need,

You have said you might be interested in doing PITR. So you want to delay the cleanup so as to not compromise that ability.  You need to develop a policy on how far back you want to be able to do a PITR.

 
but how do I know what those are?

pg_archivecleanup -n /mnt/server/archiverdir 000000010000000000000010.00000020.backup

Ok, but where does that "000000010000000000000010.00000020.backup" come from? I mean, I can tell it's a WAL segment file name (plus a backup label), but I don't have anything like that in my WAL archives, even though I've run pg_basebackup a couple of times. Do I have to call something to create that file? Some flag to pg_basebackup? At the moment I am running pg_basebackup such that it generates gziped tar files, if that makes a difference.


 
- Should I be looking at any other backup methods in addition to/instead of the basebackup/WAL archive scheme?


You may want to consider pg_receivexlog to maintain your WAL archive, rather than archive_command.  That way you don't have to worry about the trades off caused by setting archive_timeout.  But unless you use it with a replication slot, it is not very safe as the pg_receivexlog could stop working and your database would happy run along without protection.  Also, it is hard to be sure you are reliably issuing an fsyncs over NFS, so with archive_command over NFS there is always the risk your WAL data is not actually reaching disk in a timely fashion.  So if you can run pg_receivexlog running on the NFS-host machine pointed to the local storage, not looping back over NFS, that is safer.

Thanks, I'll look into it, along with the other suggested tools. Perhaps they will solve all my confusion :)


Cheers,

Jeff

pgsql-general by date:

Previous
From: said assemlal
Date:
Subject: Re: FOR UPDATE
Next
From: John R Pierce
Date:
Subject: Re: Backup "Best Practices"