Thread: Minimizing Recovery Time (wal replication)
I have two hot-spare databases that use wal archiving and continuous recovery mode. I want to minimize recovery time when we have to fail over to one of our hot spares. Right now, I'm seeing the following behavior which makes a quick recovery seem problematic: (1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period (2) hot spare pauses for 15 to 20 minutes, during this period pdflush consumes 99% IO (iotop). Dirty (from /proc/meminfo) spikes to ~760mb, remains at that level for the first 10 minutes, and then slowly ticks down to 0 for the second 10 minutes. (3) goto 1 My concern is that if the database has been in recovery mode for some time, even if it's caught up, if I go live sometime in (1) I can face a recovery time of upwards of 20 minutes. We've experienced delays during fail over in the past (not 20 minutes, but long enough to make me second guess what we are doing). I want to better understand what is going on so that I can determine what I can do (if anything) to minimize down time when we fail over to one of our hot spares. Here are my current settings: postgres (v8.3.7): shared_buffers = 2GB (15GB total) effective_cache_size = 12GB (15GB total) checkpoint_segments = 10 checkpoint_completion_target = 0.7 (other checkpoint/bgwriter settings left at default values) sysctl: kernel.shmmax = 2684354560 vm.dirty_background_ratio = 1 vm.dirty_ratio = 5 Thanks, Bryan
On Thu, 9 Apr 2009, Bryan Murphy wrote: > (1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period Yeah, if you ever let this many files queue up you're facing a long recovery time. You really need to get into a position where you're applying WAL files regularly enough that you don't ever fall this far behind. > (2) hot spare pauses for 15 to 20 minutes, during this period pdflush > consumes 99% IO (iotop). Dirty (from /proc/meminfo) spikes to ~760mb, > remains at that level for the first 10 minutes, and then slowly ticks > down to 0 for the second 10 minutes. What does vmstat say about the bi/bo during this time period? It sounds like the volume of random I/O produced by recovery is just backing up as expected. Some quick math: 15GB RAM * 5% dirty_ratio = 750MB ; there's where your measured 760MB bottleneck is coming from. 750MB / 10 minutes = 1.25MB/s ; that's in the normal range for random writes with a single disk Therefore my bet is that "vmstat 1" will show bo~=1250 the whole time you're waiting there, with matching figures from the iostat to the database disk during that period. Basically your options here are: 1) Decrease the maximum possible segment backlog so you can never get this far behind 2) Increase the rate at which random I/O can be flushed to disk by either a) Improving things with a [better] battery-backed controller disk cache b) Stripe across more disks -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Apr 9, 2009 at 6:38 PM, Greg Smith <gsmith@gregsmith.com> wrote: > What does vmstat say about the bi/bo during this time period? It sounds > like the volume of random I/O produced by recovery is just backing up as > expected. Some quick math: I'll have to capture this, unfortunately I won't be able to do that until tomorrow. This machine I was looking at is already failed over and I'm currently creating a new snapshot. I won't have a new hot spare to replace it until the morning. > 15GB RAM * 5% dirty_ratio = 750MB ; there's where your measured 760MB > bottleneck is coming from. That was what I thought, good to have it confirmed by somebody else. > 750MB / 10 minutes = 1.25MB/s ; that's in the normal range for random writes > with a single disk Yes, this is an interesting problem I'm having, more on it below... > Therefore my bet is that "vmstat 1" will show bo~=1250 the whole time you're > waiting there, with matching figures from the iostat to the database disk > during that period. > > Basically your options here are: > > 1) Decrease the maximum possible segment backlog so you can never get this > far behind I understand conceptually what you are saying, but I don't know how to practically realize this. :) Do you mean lower checkpoint_segments? > 2) Increase the rate at which random I/O can be flushed to disk by either > a) Improving things with a [better] battery-backed controller disk cache > b) Stripe across more disks This is the problem that has been my nightmare for the past few months. It actually is an 8 drive raid 10, BUT, it's on virtualized infrastructure up in Amazon's cloud running on 8 EBS volumes. I've found performance to be... inconsistent at best. Sometimes it's great, sometimes it's not so great. We have a legacy database (~120gb) which grew in our old data center on very powerful hardware. We moved it up to Amazon's cloud a few months ago, and have been scrambling ever since. I wouldn't change what we're doing, the benefits so far have outweighed the pain, and we're actively working on the software to make better use of the cloud infrastructure (i.e. many smaller databases instead of one big database, lots of caching, the usual stuff). Unfortunately, that takes time and I'm trying to limp along as best I can with the legacy database until we can get everything migrated. So, to recap, I've raided up the volumes, thrown as much RAM and CPU at the process as is available and just can't seem to tease any more performance out. Thanks, Bryan
On Thu, 9 Apr 2009, Bryan Murphy wrote: >> 1) Decrease the maximum possible segment backlog so you can never get this >> far behind > > I understand conceptually what you are saying, but I don't know how to > practically realize this. :) Do you mean lower checkpoint_segments? Theoretically, every time the archive_command writes a new segment out you can immediately move that to your standby, and setup the standby to regularly look for those and apply them as they come in. The fact that you're getting so many of them queued up suggests there's something in that path that isn't moving that pipeline along aggressively enough, without knowing more about what you're doing it's hard to say where that is. > It actually is an 8 drive raid 10, BUT, it's on virtualized > infrastructure up in Amazon's cloud running on 8 EBS volumes. I've > found performance to be... inconsistent at best. Yeah, EBS is not exactly a high-performance or predictable database storage solution, particularly when you get to where you're calling fsync a lot--which is exactly what is happening during the period you note your system is frozen. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Apr 9, 2009 at 7:33 PM, Greg Smith <gsmith@gregsmith.com> wrote: >>> 1) Decrease the maximum possible segment backlog so you can never get >>> this >>> far behind >> >> I understand conceptually what you are saying, but I don't know how to >> practically realize this. :) Do you mean lower checkpoint_segments? > > Theoretically, every time the archive_command writes a new segment out you > can immediately move that to your standby, and setup the standby to > regularly look for those and apply them as they come in. The fact that > you're getting so many of them queued up suggests there's something in that > path that isn't moving that pipeline along aggressively enough, without > knowing more about what you're doing it's hard to say where that is. This is our archiving command: #!/bin/bash echo "archiving $2.bz2" bzip2 -k -9 -c "$1" > "/srv/pg_logs/archive/$2.bz2.tmp" || exit $? mv "/srv/pg_logs/archive/$2.bz2.tmp" "/srv/pg_logs/archive/$2.bz2" || exit $? scp "/srv/pg_logs/archive/$2.bz2" "w.x.y.z:/srv/logs/$2.bz2.tmp" || exit $? ssh w.x.y.z "mv /srv/logs/$2.bz2.tmp /srv/logs/$2.bz2" || exit $? rm "/srv/pg_logs/archive/$2.bz2" || exit $? And this is our restoring command: #!/bin/bash if [ "$1" == "" ] || [ "$2" == "" ]; then echo "dbrestore [source] [destination]" exit 1 fi echo "`date`: restoring $1" while true do if [ -f "$1.bz2" ]; then echo "`date`: restore $1.bz2 -> $2" bunzip2 -d -c "$1.bz2" > "$2.tmp" mv "$2.tmp" "$2" exit 0 fi if [[ "$1" =~ ".history" ]]; then echo "`date`: skipping $1" exit 1 fi if [ -f "/tmp/golive" ]; then echo "`date`: going live" rm -f "/tmp/golive" exit 2 fi sleep 5s done Essentially, what we do is bzip2 the file, scp it to the backup server, and then ssh rename it. The bzip2 is legacy from when we were uploading to Amazon via the public internet and can go away now. The rename can happen in the restore script, and is something I probably should change anyway, one less thing for the master database to do. We create file system snapshots of the hot spares, and I periodically purge the old log files after I've verified that we can bring the most recent snapshot live. We've used NFS in the past, but we're currently investigating other distribution alternatives (primarily londiste and pgpool2). We've used slony in the past, but find it introduces too much administrative overhead and is too brittle for our tastes. Thanks again! Bryan
On Thu, 2009-04-09 at 14:27 -0500, Bryan Murphy wrote: > I have two hot-spare databases that use wal archiving and continuous > recovery mode. I want to minimize recovery time when we have to fail > over to one of our hot spares. Right now, I'm seeing the following > behavior which makes a quick recovery seem problematic: > > (1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period > > (2) hot spare pauses for 15 to 20 minutes, during this period pdflush > consumes 99% IO (iotop). Dirty (from /proc/meminfo) spikes to ~760mb, > remains at that level for the first 10 minutes, and then slowly ticks > down to 0 for the second 10 minutes. > > (3) goto 1 The database is performing too frequent restartpoints. This has been optimised in PostgreSQL 8.4 by the addition of the bgwriter running during recovery. This will mean that your hot spare will not pause while waiting for restartpoint to complete, so will be significantly faster. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Sun, Apr 12, 2009 at 5:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > The database is performing too frequent restartpoints. > > This has been optimised in PostgreSQL 8.4 by the addition of the > bgwriter running during recovery. This will mean that your hot spare > will not pause while waiting for restartpoint to complete, so will be > significantly faster. 8.4 is already looking like it's going to be a great upgrade for us, this would be another win. Thanks, Bryan
If you're using "full page write = off", prefetching data page from WAL speeds up the recovery too. I've proposed this patch but was carried over to 8.5. I'm planning to release this as an external tool, which will not work well with Hot Standby but will work in usual recovery. I'm not sure how soon I can do it but hope it works. 2009/4/13 Bryan Murphy <bmurphy1976@gmail.com>: > On Sun, Apr 12, 2009 at 5:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> The database is performing too frequent restartpoints. >> >> This has been optimised in PostgreSQL 8.4 by the addition of the >> bgwriter running during recovery. This will mean that your hot spare >> will not pause while waiting for restartpoint to complete, so will be >> significantly faster. > > 8.4 is already looking like it's going to be a great upgrade for us, > this would be another win. > > Thanks, > Bryan > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > -- ------ Koichi Suzuki