Thread: Sync Replication with transaction-controlled durability
I'm working on a patch to implement synchronous replication for PostgreSQL, with user-controlled durability specified on the master. The design also provides high throughput by allowing concurrent processes to handle the WAL stream. The proposal requires only 3 new parameters and takes into account much community feedback on earlier ideas. The patch is fairly simple, though reworking it to sit on top of latches means I haven't quite finished it yet. I was requested to explain the design as soon as possible, so am posting this ahead of the patch. = WHAT'S DIFFERENT ABOUT THIS PATCH? * Low complexity of code on Standby * User control: All decisions to wait take place on master, allowing fine-grained control of synchronous replication. Max replication level can also be set on the standby. * Low bandwidth: Very small response packet size with no increase in number of responses when system is under high load means very little additional bandwidth required * Performance: Standby processes work concurrently to give good overall throughput on standby and minimal latency in all modes. 4 performance options don't interfere with each other, so offer different levels of performance/durability alongside each other. These are major wins for PostgreSQL project over and above the basic sync rep feature. = SYNCHRONOUS REPLICATION OVERVIEW Synchronous replication offers the guarantee that all changes made by a transaction have been transferred to remote standby nodes. This is an extension to the standard level of durability offered by a transaction commit. When synchronous replication is requested the transaction will wait after it commits until it receives confirmation that the transfer has been successful. Waiting for confirmation increases the user's certainty that the transfer has taken place but it also necessarily increases the response time for the requesting transaction. Synchronous replication usually requires carefully planned and placed standby servers to ensure applications perform acceptably. Waiting doesn't utilise system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will lead to reduced performance for database applications. It may seem that there is a simple choice between durability and performance. However, there is often a close relationship between the importance of data and how busy the database needs to be, so this is seldom a simple choice. With this patch, PostgreSQL now provides a range of features designed to allow application architects to design a system that has both good overall performance and yet good durability of the most important data assets. PostgreSQL allows the application designer to specify the durability level required via replication. This can be specified for the system overall, though it can also be specified for individual transactions. This allows to selectively provide highest levels of protection for critical data. For example we, an application might consist of two types of work: * 10% of changes are changes to important customer details * 90% of changes are less important data that the business can more easily survive if it is lost, such as chat messages between users. With sync replication options specified at the application level (on the master) we can offer sync rep for the most important changes, without slowing down the bulk of the total workload. Application level options are an important and practical tool for allowing the benefits of synchronous replication for high performance applications. Without sync rep options specified at app level, we would have a choice of either slowing down 90% of the workload because 10% of it is important. Or giving up our durability goals because of performance. Or splitting those two functions onto separate database servers so that we can set options differently on each. None of those 3 options is truly attractive. PostgreSQL also allows the system administrator the ability to specify the service levels offered by standby servers. This allows multiple standby servers to work together in various roles within a server farm. Control of this feature relies on just 3 parameters: On the master we can set * synchronous_replication * synchronous_replication_timeout On the standby we can set * synchronous_replication_service These are explained in more detail in the following sections. = USER'S OVERVIEW Two new USERSET parameters on the master control this * synchronous_replication = async (default) | recv | fsync | apply * synchronous_replication_timeout = 0+ (0 means never timeout) (default timeout 10sec) synchronous_replication = async is the default and means that no synchronisaton is requested and so the commit will not wait. This is the fastest setting. The word async is short for "asynchronous" and you may see the term asynchronous replication discussed. Other settings refer to progressively higher levels of durability. The higher the level of durability requested, the longer the wait for that level of durability to be achieved. The precise meaning of the synchronous_replication settings is * async - commit does not wait for a standby before replying to user * recv - commit waits until standby has received WAL * fsync - commit waits until standby has received and fsynced WAL * apply - commit waits until standby has received, fsynced and applied This provides a simple, easily understood mechanism - and one that in its default form is very similar to other RDBMS (e.g. Oracle). Note that in apply mode it is possible that the changes could be accessible on the standby before the transaction that made the change has been notified that the change is complete. Minor issue. Network delays may occur and the standby may also crash. If no reply is received within the timeout we raise a NOTICE and then return successful commit (no other action is possible). Note that it is possible to request that we never timeout, so if no standby is available we wait for it one to appear. When user commits, if the master does not have a currently connected standby offering the required level of replication it will pick the next best available level of replication. It is up to the sysadmin to provide sufficient range of standby nodes to ensure at least one is available to meet the requested service levels. If multiple standbys exist, the first standby to reply that the desired level of durability has been achieved will release the waiting commit on the master. Other options are available also via a plugin. == ADMINISTRATOR'S OVERVIEW On the standby we specify the highest type of replication service offered by this standby server. This information is passed to the master server when the standby connects for replication. This allows sysadmins to designate preferred standbys. It also allows sysadmins to completely refuse to offer a synchronous replication service, allowing a master to explicitly avoid synchronisation across low bandwidth or high latency links. An additional parameter can be set in recovery.conf on the standby * synchronous_replication_service = async (def) | recv | fsync | apply = IMPLEMENTATION Some aspects can be changed without significantly altering basic proposal, for example master-specified standby registration wouldn't really alter this very much. == STANDBY Master-controlled sync rep means that all user wait logic is centred on the master. The details of sync rep requests on the master are not sent to the standby, so there is no additional master to standby traffic nor standby-side bookkeeping overheads. It also reduces complexity of standby code. On the standby side the WAL Writer now operates during recovery. This frees the WALReceiver to spend more time sending and receiving messages, thereby minimising latency for users choosing the "recv" option. We now have 3 processes handling WAL in an asynchronous pipeline: WAL Receiver reads WAL data from the libpq connection then writes it to the WAL file, the WAL Writer then fsyncs the WAL file and then the Startup process replays the WAL. These processes act independently, so WAL pointers (LSNs) are defined as WALReceiverLSN >= WALWriterLSN >= StartupLSN For each new message WALReceiver gets from master we issue a reply. Each reply sends the current state of the 3 LSNs, so the reply message size is only 28 bytes. Replies are sent half-duplex, i.e. we don't reply while a new message is arriving. Note that there is absolutely not one reply per transaction on the master. The standby knows nothing about what has been requested on the master - replies always refer to the latest standby state and effectively batch the responses. We act according to the requested synchronous_replication_service * async - no replies are sent * recv - replies are sent upon receipt only * fsync - replies are sent upon receipt and following fsync only * apply - replies are sent following receipt, fsync and apply. Replies are sent at the next available opportunity. In apply mode, when the WALReceiver is completely quiet this means we send 3 reply messages - one at recv, one at fsync and one at apply. When WALreceiver is busy the volume of messages does *not* increase since the reply can't be sent until the current incoming message has been received, after which we were going to reply anyway so it is not an additional message. This means we piggyback an "apply" response onto a later "recv" reply. As a result we get minimum response times in *all* modes and maximum throughput is not impaired at all. When each new messages arrives from master the WALreceiver will write the new data to the WAL file, wake the WALwriter and then reply. Each new message from master receives a reply. If no further WAL data has been received the WALreceiver waits on the latch. If the WALReceiver is woken by WALWriter or Startup then it will reply to master with a message, even if no new WAL has been received. So in both recv, fsync and apply cases a message as soon as possible to master, so in all cases the wait time is minimised. When WALwriter is woken it sees if there is outstanding WAL data and if so fsyncs it and wakes both WALreceiver and Startup. When no WAL remains it waits on the latch. Startup process will wake WALreceiver when it has got to the end of the latest chunk of WAL. If no further WAL is available then it waits on its latch. == MASTER When user backends request sync rep they wait in a queue ordered by requested LSN. A separate queue exists for each request mode. WALSender receives the 3 LSNs from the standby. It then wakes backends in sequence from each queue. We provide a single wakeup rule: first WALSender to reply with the requested XLogRecPtr will wake the backend. This guarantees that the WAL data for the commit is transferred as requested to at least one standby. That is sufficient for the use cases we have discussed. More complex wakeup rules would be possible via a plugin. Wait timeout would be set by individual backends with a timer, just as we do for statement_timeout. = CODE Total code to implement this is low. Breaks down into 5 areas * Zoltan's libpq changes, included almost verbatim; fairly modular, so easy to replace with something we like better * A new module syncrep.c and syncrep.h handle the backend wait/wakeup * Light changes to allow streaming rep to make appropriate calls * Small amount of code to allow WALWriter to be active in recovery * Parameter code No docs yet. The patch works on top of latches, though does not rely upon them for its bulk performance characteristics. Latches only improve response time for very low transaction rates; latches provide no additional throughput for medium to high transaction rates. = PERFORMANCE ANALYSIS Since we reply to each new chunk sent from master, "recv" mode has absolutely minimal latency, especially since WALreceiver no longer performs majority of fsyncs, as in 9.0 code. WALreceiver does not wait for fsync or apply actions to complete before we reply, so fsync and apply modes will always wait at least 2 standby->master messages which is appropriate because those actions will typically occur much later. This response mechanism offers highest responsive performance achievable in "recv" mode and very good throughput under load. Note that the different modes do not interfere with each other and can co-exist happily while providing highest performance. Starting WALWriter is helpful, no matter what the synchronous_replication_service specified. Can we optimise the sending of reply messages so that only chunks that contain a commit deserve a reply? We could, but then we'd need to do extra work on the master to do bookkeeping of that. It would need to be demonstrated that there is a performance issue big enough to be worth the overhead on master and extra code. Is there an optimisation from reducing the number of options the standby provides? The architecture on the standby side doesn't rely heavily on the service level specified, nor does it rely in any way on the actual sync rep mode specified on master. No further simplification is possible. = NOT YET IMPLEMENTED * Timeout code & NOTICE * Code and test plugin * Loops in walsender, walwriter and receiver treat shutdown incorrectly I haven't yet looked at Fujii's code for this, not even sure where it is, though hope to do so in the future. Zoltan's libpq code is the only part of that patch used. So far I have spent 3.5 days on this and expect to complete tomorrow. I think that throws out the argument that this proposal is too complex to develop in this release. = OTHER ISSUES * How should master behave when we shut it down? * How should standby behave when we shut it down? I will post my patch on this thread when it is available. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Sep 14, 2010, at 10:48 AM, Simon Riggs wrote: > I will post my patch on this thread when it is available. Sounds awesome Simon, I look forward to seeing the discussion! Best, David
On 14 September 2010 21:36, David E. Wheeler <david@kineticode.com> wrote: > On Sep 14, 2010, at 10:48 AM, Simon Riggs wrote: > >> I will post my patch on this thread when it is available. > > Sounds awesome Simon, I look forward to seeing the discussion! > > Best, > > David Excellent! :) I actually understand how that works amazingly. Nice work Simon :) -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On 14/09/10 20:48, Simon Riggs wrote: > When each new messages arrives from master the WALreceiver will write > the new data to the WAL file, wake the WALwriter and then reply. Each > new message from master receives a reply. If no further WAL data has > been received the WALreceiver waits on the latch. If the WALReceiver is > woken by WALWriter or Startup then it will reply to master with a > message, even if no new WAL has been received. Wrt. the earlier discussion about when the standby sends the acknowledgment, this is the key paragraph. So you *are* sending multiple acknowledgments per transaction, but there is some smarts to combine them when there's a lot of traffic. Fair enough. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-09-15 at 11:54 +0300, Heikki Linnakangas wrote: > On 14/09/10 20:48, Simon Riggs wrote: > > When each new messages arrives from master the WALreceiver will write > > the new data to the WAL file, wake the WALwriter and then reply. Each > > new message from master receives a reply. If no further WAL data has > > been received the WALreceiver waits on the latch. If the WALReceiver is > > woken by WALWriter or Startup then it will reply to master with a > > message, even if no new WAL has been received. > > Wrt. the earlier discussion about when the standby sends the > acknowledgment, this is the key paragraph. So you *are* sending multiple > acknowledgments per transaction, but there is some smarts to combine > them when there's a lot of traffic. Fair enough. Not really. It's a simple design that works on chunks of WAL data, not individual transactions. There is literally zero code executed to achieve that, nor is bandwidth expended passing additional information. "Smarts" tends to imply some complex optimization, whereas this is the best optimization of all: no code whatsoever "per transaction". If no new WAL is received then we do two extra messages, that's all, but those replies only occur when the inbound path is otherwise quiet. In typical case of a busy system there is one reply per chunk of WAL. Since we already piggyback WAL writes into chunks in XLogWrite() that means each reply acknowledges many transactions and there are zero additional messages. Fast, efficient, no extra code. When there is only one commit in a chunk of WAL data *and* the standby is configured for 'apply' *and* nothing else occurs afterwards for some time (long enough for an fsync and an apply, so at least 10ms), there will be 3 replies for one transaction. That won't be the typical case and even when it does happen its not a problem because the server is otherwise quiet (by definition). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Sep 15, 2010, at 5:23 AM, Simon Riggs wrote: > Fast, efficient, no extra code. I love that sentence. Even if it has no verb. Best, David
On Tue, 2010-09-14 at 18:48 +0100, Simon Riggs wrote: > I'm working on a patch to implement synchronous replication for > PostgreSQL, with user-controlled durability specified on the master. The > design also provides high throughput by allowing concurrent processes to > handle the WAL stream. The proposal requires only 3 new parameters and > takes into account much community feedback on earlier ideas. Text here added to wiki http://wiki.postgresql.org/wiki/Synchronous_replication For better readability and tracking of changes. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Tue, 2010-09-14 at 18:48 +0100, Simon Riggs wrote: > I'm working on a patch to implement synchronous replication for > PostgreSQL, with user-controlled durability specified on the master. The > design also provides high throughput by allowing concurrent processes to > handle the WAL stream. The proposal requires only 3 new parameters and > takes into account much community feedback on earlier ideas. I'm now implementing v5, which simplifies the parameters still further USERSET on master * synchronous_replication = off (default) | on * synchronous_replication_timeout >=0 default=0 means wait forever set in postgresql.conf on standby * synchronous_replication_service = on (default) | off WALwriter is not active, nor are multiple sync rep modes available. Coding allows us to extend number of modes in future. Coding also solves problem raised by Dimitri: we don't advertise the sync rep service until the standby has caught up. This patch is a rough WIP, mostly stripping out and streamlining. It doesn't work yet, but people say they like to see me working, so here 'tis. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Fri, Oct 8, 2010 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-09-14 at 18:48 +0100, Simon Riggs wrote: > >> I'm working on a patch to implement synchronous replication for >> PostgreSQL, with user-controlled durability specified on the master. The >> design also provides high throughput by allowing concurrent processes to >> handle the WAL stream. The proposal requires only 3 new parameters and >> takes into account much community feedback on earlier ideas. > > I'm now implementing v5, which simplifies the parameters still further > > USERSET on master > * synchronous_replication = off (default) | on > * synchronous_replication_timeout >=0 default=0 means wait forever > > set in postgresql.conf on standby > * synchronous_replication_service = on (default) | off > > WALwriter is not active, nor are multiple sync rep modes available. > Coding allows us to extend number of modes in future. > > Coding also solves problem raised by Dimitri: we don't advertise the > sync rep service until the standby has caught up. > > This patch is a rough WIP, mostly stripping out and streamlining. It > doesn't work yet, but people say they like to see me working, so here > 'tis. It seems like it would be more helpful if you were working on implementing a design that had more than one vote. As far as I can tell, we have rough consensus that for the first commit we should only worry about the case where k = 1; that is, only one ACK is ever required for commit; and Greg Smith spelled out some more particulars for a minimum acceptable implementation in the second part of the email found here: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00384.php That proposal is, AFAICT, the ONLY one that has got more than one vote, and certainly the only one that has got as many votes as that one does. If you want to implement that, then I think we could reach critical consensus on committing it very quickly. If you DON'T want to implement that proposal, then I suggest that we let Fujii Masao or Heikki implement and commit it. I realize, as you've pointed out before, that there is no danger of missing 9.1 at this point, but on the flip side I don't see that there's anything to be gained by spending another month rehashing the topic when there's a good proposal on the table that's got some momentum behind it. Let's not make this more complicated than it needs to be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > It seems like it would be more helpful if you were working on > implementing a design that had more than one vote. As far as I can > tell, we have rough consensus that for the first commit we should only > worry about the case where k = 1; that is, only one ACK is ever > required for commit My understanding by reading the mails here and quick-reading the patch (in my MUA, that's how quick the reading was), is that what you want here is what's done in the patch, which has been proposed as a WIP, too. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Fri, Oct 8, 2010 at 5:59 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> It seems like it would be more helpful if you were working on >> implementing a design that had more than one vote. As far as I can >> tell, we have rough consensus that for the first commit we should only >> worry about the case where k = 1; that is, only one ACK is ever >> required for commit > > My understanding by reading the mails here and quick-reading the patch > (in my MUA, that's how quick the reading was), is that what you want > here is what's done in the patch, which has been proposed as a WIP, too. It's not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2010-10-08 at 12:23 -0400, Robert Haas wrote: > It seems like it would be more helpful if you were working on > implementing a design that had more than one vote. As far as I can > tell, we have rough consensus that for the first commit we should only > worry about the case where k = 1; that is, only one ACK is ever > required for commit; and Greg Smith spelled out some more particulars > for a minimum acceptable implementation in the second part of the > email found here: > > http://archives.postgresql.org/pgsql-hackers/2010-10/msg00384.php Robert, I'm working on k = 1, as suggested by Josh Berkus and with whom many people agree. It is a simple default behaviour that will be easy to test. Greg's proposal to implement other alternatives via a function is simply a restatement of what I had already proposed: we should have a plugin to provide alternate behaviours. We can add the plugin API later once we have a stable committed version. I am happy to do that, just as I originally proposed. I don't believe it will be helpful to attempt to implement something more complex until we have the basic version. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Sat, Oct 9, 2010 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2010-10-08 at 12:23 -0400, Robert Haas wrote: > >> It seems like it would be more helpful if you were working on >> implementing a design that had more than one vote. As far as I can >> tell, we have rough consensus that for the first commit we should only >> worry about the case where k = 1; that is, only one ACK is ever >> required for commit; and Greg Smith spelled out some more particulars >> for a minimum acceptable implementation in the second part of the >> email found here: >> >> http://archives.postgresql.org/pgsql-hackers/2010-10/msg00384.php > > Robert, > > I'm working on k = 1, as suggested by Josh Berkus and with whom many > people agree. It is a simple default behaviour that will be easy to > test. > > Greg's proposal to implement other alternatives via a function is simply > a restatement of what I had already proposed: we should have a plugin to > provide alternate behaviours. We can add the plugin API later once we > have a stable committed version. I am happy to do that, just as I > originally proposed. > > I don't believe it will be helpful to attempt to implement something > more complex until we have the basic version. I agree that we should start with a basic version, but it seems to me that you're ripping out things which are uncontroversial and leaving untouched things with are. To the best of my knowledge, there are no serious or widespread objections to allowing three synchronous replication levels: recv, fsync, apply. There are, however, a number of people, including me, who don't feel that whether or not the slave is synchronous should be configured on the slave. As Greg said: That would be a simple to configure setup where I list a subset of "important" nodes, and the appropriate acknowledgement level I want to hear from one of them. And when one of those nodes gives that acknowledgement, commit on the master happens too. I am not going to put words in Greg's mouth, so I won't claim that when he speaks of listing a subset of important nodes, he actually means putting a list of them someplace, but that's what I and at least some other people want. In your design, AIUI, the list is implied by the settings on the slaves, not explicit. I also think that if you're removing things for the first version, the timeout might be one to rip out. In between all the discussion of how and where synchronous replication replication ought to be configured, we have had some discussion of whether there should be a timeout, but almost no discussion of what the behavior of that timeout should be. Are we going to wait for that timeout on every commit? That seems almost certain to reduce a busy master to unusability. Is it a timeout before the slave is "declared dead" even though it's still connected, and if so how does the slave come back to life again? Is it a timeout before we forcibly disconnect the slave, and if so how is it better/worse/different than configuring TCP keepalives? I'm sure we can figure out good answers to all of those questions but it might take a while to get consensus on any particular approach. One other question that occurred to me this morning, not directly related to anything you're doing here. What exactly happens if the user types COMMIT, it hangs for a long time because it can't get an ACK from any other server, and the user gets tired of waiting and starts hitting ^C? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 8, 2010 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > This patch is a rough WIP, mostly stripping out and streamlining. It > doesn't work yet, but people say they like to see me working, so here > 'tis. It's been two months since you posted this. Any update? I'd like to actually review the two patches on the table and form a more educated opinion about the differences between them and which may be better, but neither one applies at the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company