Thread: Straightforward Synchronous Replication
Following design offers simplicity of design, performance and user control over sync rep waits, including wait-for-apply for HS. This implements Oracle's Maximum Availability option AND Maximum Performance options both together, rather than just one or the other: async and sync replication together, under user control. * BACKEND: In xact.c: Immediately after fsync during commit logic if (sync_rep != NONE) {max_wakeup_time = commit_timestamp + sync_rep_timeout;SetAlarm(max_wakeup_time); // similar to statement timeoutWaitOnQueue(commitLSN);DisableAlarm(); } In proc.c: in signal handler code if (wakeup && waiting_on_commit)RemoveFromQueue() * New process: WALSync (on primary) Receives messages from WALAck on standby and wakes up queued backends that have reached the requested commitLSN. If there are multiple WALSync processes they all try to remove backends from the head of the queue. Process started in same way as WALSender, when request arrives from standby. (WaitOnQueue() returns immediately if no WALSync are started, since that means no sync rep yet available) * New process: WALAck (on standby) Reads shared memory to get last received and last applied xlog location and sends message to WALSync on primary. Loop/Sleep forever. Values in shared mem already put there by WALReceiver and Startup processes. Reuse message protocol as for WALSender->WALReceiver. Process started after WALReceiver connects, if additional option in recovery.conf. Initiates second connection to primary, issues slightly different startup command to create WALSync. That's it. The above needs just two parameters at user level synch_rep = none | recv | apply synch_rep_timeout = Ns and an additional parameter in recovery.conf to say whether a standby is providing the facility for sync replication (as requested by Yeb etc) (default = yes). So this is the same as having quorum = 0 or 1 (boring but simple) and having sync_rep_timeout_action = commit in all cases (clear behaviour in failure modes, without need for per-standby parameters). The user specifies how long they wish to wait, but that wait never changes the flow of WAL data through the cluster, so we don't need to retune and redesign the existing system for reduced latency. It allows mixed synchronous and asynchronous replication with *ease*. If we design things differently that wouldn't be the case. The design is: * simple - Doesn't require any WAL or libpq changes * modular - almost completely isolated from existing components in 9.0. (e.g. WALSender doesn't know or care about WALSync, WALReceiver never needs to speak to WALAck directly). * performant - async and sync can co-exist; WALReceiver never waits; no need to retune WALSender operation for synchronous mode * low latency - the backchannel from standby to primary uses a separate connection so can operate without slowing down data from primary * user centric - allows user control over this feature, an important tool for real world performance * hot standby - implements xid back channel with ease (later phase) We can hang other options on this later - nothing else is essential. Development time ~ 1 man month because similar code exists for all aspects described above, so no research or internals discussion required. Yes, this is a 3rd design for sync rep, though I think it improves upon the things I've heard so far from other authors and also includes feedback from Dimitri, Heikki, Yeb, Alastair. I'm happy to code this as well, when 9.1 dev starts and a benchmark should be interesting also. -- Simon Riggs www.2ndQuadrant.com
On Thu, May 27, 2010 at 9:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > * New process: WALAck (on standby) > Reads shared memory to get last received and last applied xlog location > and sends message to WALSync on primary. Loop/Sleep forever. So would WALAck be polling shared memory? That would increase latency significantly, I think, though perhaps you have a plan for avoiding that? > The above needs just two parameters at user level > synch_rep = none | recv | apply > synch_rep_timeout = Ns > and an additional parameter in recovery.conf to say whether a standby is > providing the facility for sync replication (as requested by Yeb etc) > (default = yes). > > So this is the same as having quorum = 0 or 1 (boring but simple) and > having sync_rep_timeout_action = commit in all cases (clear behaviour in > failure modes, without need for per-standby parameters). This seems good, but I think we need a little more definition about what happens with sync_rep_timeout expires. > Yes, this is a 3rd design for sync rep, though I think it improves upon > the things I've heard so far from other authors and also includes > feedback from Dimitri, Heikki, Yeb, Alastair. I'm happy to code this as > well, when 9.1 dev starts and a benchmark should be interesting also. It's great that we have so many people who want to implement this feature, or in one case already have. I'm not sure whose design is best, but I do hope that we can avoid dueling patches. There are plenty of other good features to work on also. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, 2010-05-27 at 10:11 -0400, Robert Haas wrote: > On Thu, May 27, 2010 at 9:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > * New process: WALAck (on standby) > > Reads shared memory to get last received and last applied xlog location > > and sends message to WALSync on primary. Loop/Sleep forever. > > So would WALAck be polling shared memory? That would increase latency > significantly, I think, though perhaps you have a plan for avoiding > that? The backends are going to be released in batches anyway, so I can't see how polling makes a difference. Polling means no waiting, so asynchronous action and higher throughput, and with sufficiently high polling rate no significant loss of latency. The other plan requires WALReceiver to wait for fsync and apply, which seems very likely to suck badly from a latency perspective. While its waiting it is also reducing throughout of incoming WAL. It's hard to see how that would work well. You could also do this by avoiding the wait in WALReceiver, but then that becomes more like polling anyway. > > The above needs just two parameters at user level > > synch_rep = none | recv | apply > > synch_rep_timeout = Ns > > and an additional parameter in recovery.conf to say whether a standby is > > providing the facility for sync replication (as requested by Yeb etc) > > (default = yes). > > > > So this is the same as having quorum = 0 or 1 (boring but simple) and > > having sync_rep_timeout_action = commit in all cases (clear behaviour in > > failure modes, without need for per-standby parameters). > > This seems good, but I think we need a little more definition about > what happens with sync_rep_timeout expires. It commits... that is very clear: "sync_rep_timeout_action = commit in all cases". Commit is the only viable option, since abort and wait-forever both have disadvantages pointed out for them. > > Yes, this is a 3rd design for sync rep, though I think it improves upon > > the things I've heard so far from other authors and also includes > > feedback from Dimitri, Heikki, Yeb, Alastair. I'm happy to code this as > > well, when 9.1 dev starts and a benchmark should be interesting also. > > It's great that we have so many people who want to implement this > feature, or in one case already have. I'm not sure whose design is > best, but I do hope that we can avoid dueling patches. There are > plenty of other good features to work on also. There is already a patch on SR, yet Masao is discussing another that contains what looks to me like very close to nothing of Zoltan's work, not even similar ideas. The dueling patches situation looks like it already exists to me, though not of my making or encouragement. Even if I agreed with everything one of those authors say, there would still be two patches. Considering a variety of design approaches seems like a good idea for an important feature, especially when the information is thin and opinions run high. It's unlikely that anyone is right about everything, which is why I've amalgamated this simple proposal from everything said so far. It's easy to add some things if we add them at the start, much harder to retrofit them. I've shown that some things are easier than has been said, with fewer parameters and a good case for better performance also. -- Simon Riggs www.2ndQuadrant.com
On Thu, May 27, 2010 at 11:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2010-05-27 at 10:11 -0400, Robert Haas wrote: >> On Thu, May 27, 2010 at 9:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > * New process: WALAck (on standby) >> > Reads shared memory to get last received and last applied xlog location >> > and sends message to WALSync on primary. Loop/Sleep forever. >> >> So would WALAck be polling shared memory? That would increase latency >> significantly, I think, though perhaps you have a plan for avoiding >> that? > > The backends are going to be released in batches anyway, so I can't see > how polling makes a difference. > > Polling means no waiting, so asynchronous action and higher throughput, > and with sufficiently high polling rate no significant loss of latency. I guess what I'm trying to figure out is the part that says "loop/sleep forever". That sounds like you wait 50 ms (or some other interval), then check shared memory to see if anything has changed, if not you do it again. That means that up to 49.9 ms (or whatever interval you picked) could be spent waiting before you realize that new WAL has been applied, which I suspect will not work out very well.On the other hand checking it in a TIGHT loop wouldmean using up a whole CPU on an idle system, so that's not practical either. ISTM you'd need some kind of signalling system between the startup process and the WALAck process, so that the startup process can wake WALAck after applying each bit of WAL (or maybe the startup process knows about the lowest LSN that WALAck cares about, and wakes it only upon reaching that point). > The other plan requires WALReceiver to wait for fsync and apply, which > seems very likely to suck badly from a latency perspective. While its > waiting it is also reducing throughout of incoming WAL. It's hard to see > how that would work well. > > You could also do this by avoiding the wait in WALReceiver, but then > that becomes more like polling anyway. I'm not sure if I understand this part, so let me try to say it another way and you can tell me if I've got it right. I think your concern is that, during the time that WALReceiver is waiting for one chunk of WAL to get fsynced, the startup process might finish applying an earlier chunk of WAL that is of interest to the master. The ACK will therefore be delayed until the fsync completes and WALReceiver can again do other things, like check whether there are any ACKs that must be sent. Is that it, or have I missed the boat completely? >> > The above needs just two parameters at user level >> > synch_rep = none | recv | apply >> > synch_rep_timeout = Ns >> > and an additional parameter in recovery.conf to say whether a standby is >> > providing the facility for sync replication (as requested by Yeb etc) >> > (default = yes). >> > >> > So this is the same as having quorum = 0 or 1 (boring but simple) and >> > having sync_rep_timeout_action = commit in all cases (clear behaviour in >> > failure modes, without need for per-standby parameters). >> >> This seems good, but I think we need a little more definition about >> what happens with sync_rep_timeout expires. > > It commits... that is very clear: "sync_rep_timeout_action = commit in > all cases". Commit is the only viable option, since abort and > wait-forever both have disadvantages pointed out for them. So, do we declare the sync server offline at that point and stop waiting for it, or do we continue waiting for it on every transaction?If we declare it dead, what are the criteria for subsequentlymaking it alive again? >> > Yes, this is a 3rd design for sync rep, though I think it improves upon >> > the things I've heard so far from other authors and also includes >> > feedback from Dimitri, Heikki, Yeb, Alastair. I'm happy to code this as >> > well, when 9.1 dev starts and a benchmark should be interesting also. >> >> It's great that we have so many people who want to implement this >> feature, or in one case already have. I'm not sure whose design is >> best, but I do hope that we can avoid dueling patches. There are >> plenty of other good features to work on also. > > There is already a patch on SR, yet Masao is discussing another that > contains what looks to me like very close to nothing of Zoltan's work, > not even similar ideas. The dueling patches situation looks like it > already exists to me, though not of my making or encouragement. Even if > I agreed with everything one of those authors say, there would still be > two patches. Oh, I wasn't aware that Fujii Masao's work had progressed as far as an actual patch yet. > Considering a variety of design approaches seems like a good idea for an > important feature, especially when the information is thin and opinions > run high. It's unlikely that anyone is right about everything, which is > why I've amalgamated this simple proposal from everything said so far. Agreed. > It's easy to add some things if we add them at the start, much harder to > retrofit them. I've shown that some things are easier than has been > said, with fewer parameters and a good case for better performance also. I am personally not sure who has the best design at this point in time, but I am glad that we are moving in the direction of simplifying. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company