Thread: write ahead logging in standby (streaming replication)
Hi, Should the standby also have to follow the WAL rule during recovery? The current patch doesn't care about the write order of the data page and WAL in the standby. So, after both servers fail, restarting the ex-standby by itself might corrupt the data. If the standby follows the WAL rule, walreceiver might delay in writing WAL records until the startup process' or bgwriter's fsync have been finished. I'm a bit concerned that such delay might increase the performance overhead on the primary. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > Should the standby also have to follow the WAL rule during recovery? > The current patch doesn't care about the write order of the data page > and WAL in the standby. So, after both servers fail, restarting the > ex-standby by itself might corrupt the data. Surely the receiver should fsync the WAL itself to disk before acknowledging it. Assuming you've done that, I don't see any corruption risk. regards, tom lane
On Thu, Nov 12, 2009 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> Should the standby also have to follow the WAL rule during recovery? >> The current patch doesn't care about the write order of the data page >> and WAL in the standby. So, after both servers fail, restarting the >> ex-standby by itself might corrupt the data. > > Surely the receiver should fsync the WAL itself to disk before > acknowledging it. Assuming you've done that, I don't see any > corruption risk. "acknowledging it" means "letting the startup process know the arrival of WAL records"? If so, I agree that there is no risk of data corruption. The problem is that fsync needs to be issued too frequently, which would be harmless in asynchronous replication, but not in synchronous one. A transaction would have to wait for the primary's and standby's fsync before returning a "success" to a client. So I'm inclined to change the startup process and bgwriter, instead of walreceiver, so as to fsync the WAL for the WAL rule. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > The problem is that fsync needs to be issued too frequently, which would > be harmless in asynchronous replication, but not in synchronous one. > A transaction would have to wait for the primary's and standby's fsync > before returning a "success" to a client. > > So I'm inclined to change the startup process and bgwriter, instead of > walreceiver, so as to fsync the WAL for the WAL rule. Let's keep it simple for now. Just make the walreceiver do the fsync. We can optimize later. For now, we're only going to have async mode anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, On Thu, Nov 12, 2009 at 4:32 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> The problem is that fsync needs to be issued too frequently, which would >> be harmless in asynchronous replication, but not in synchronous one. >> A transaction would have to wait for the primary's and standby's fsync >> before returning a "success" to a client. >> >> So I'm inclined to change the startup process and bgwriter, instead of >> walreceiver, so as to fsync the WAL for the WAL rule. > > Let's keep it simple for now. Just make the walreceiver do the fsync. We > can optimize later. For now, we're only going to have async mode anyway. Okey, I'll do that; the walreceiver issues the fsync for each arrival of the WAL records, and the startup process replays only the records already fsynced. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2009-11-12 at 17:03 +0900, Fujii Masao wrote: > On Thu, Nov 12, 2009 at 4:32 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > Fujii Masao wrote: > >> The problem is that fsync needs to be issued too frequently, which would > >> be harmless in asynchronous replication, but not in synchronous one. > >> A transaction would have to wait for the primary's and standby's fsync > >> before returning a "success" to a client. > >> > >> So I'm inclined to change the startup process and bgwriter, instead of > >> walreceiver, so as to fsync the WAL for the WAL rule. > > > > Let's keep it simple for now. Just make the walreceiver do the fsync. We > > can optimize later. For now, we're only going to have async mode anyway. > > Okey, I'll do that; the walreceiver issues the fsync for each arrival of > the WAL records, and the startup process replays only the records already > fsynced. I agree with you, though it has taken some time to understand what you said and at first my reaction was to disagree. I think the responses you got on this are because you dived straight in with a question before explaining other things around this. We already have a number of options for how to handle incoming WAL. We can choose to fsync or not when WAL arrives. Choosing *not* to fsync would be the typical choice because it provides reasonable performance; fsyncing after each transaction commit would be worse. In any case, if WAL receiver does the fsyncs then we will get worse performance. If we reduce the number of fsyncs it does we just get spiky behaviour around the fsyncs. If recovery starts reading WAL records that have not been fsynced then we may need to flush a shared buffer to disk that depends upon a non-fsynced(yet) WAL record. Fsyncing WAL after *every* WAL record is going to make performance suck even worse and is completely out of the question. So implementing the fsync-WAL-before-buffer-flush rule during recovery makes much more sense. It's also only small change during XlogFlush(). Another way of doing this would be to only allow recovery to progress as far as has been fsynced. That seems a more plausible approach, but would lead to delays if we had a small number of long write transactions. The benefit of streaming is that it potentially allows us to keep as near to real-time recovery as possible. So overall, yes, we need to do as you suggested: implement WAL rule in recovery. WALreceiver smoothly does write(), Startup replays and we leave the WAL file fsyncs to be performed by the bgwriter. But I also agree with Heikki. Let's plan to do this later in this release. -- Simon Riggs www.2ndQuadrant.com
On Thu, Nov 12, 2009 at 6:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I agree with you, though it has taken some time to understand what you > said and at first my reaction was to disagree. I think the responses you > got on this are because you dived straight in with a question before > explaining other things around this. Thanks for clarifying this topic ;) > If recovery starts reading WAL records that have not been fsynced then > we may need to flush a shared buffer to disk that depends upon a > non-fsynced(yet) WAL record. Fsyncing WAL after *every* WAL record is > going to make performance suck even worse and is completely out of the > question. So implementing the fsync-WAL-before-buffer-flush rule during > recovery makes much more sense. It's also only small change during > XlogFlush(). Agreed. This approach has lesser impact on the performance. But, as I said on my first post on this thread, even such low-frequent fsync-WAL-before-buffer-flush might cause a response time spike on the primary because the walreceiver must sleep during that fsync. I think that leaving the WAL-logging business to another process like walwriter is a good idea for reducing further the impact on the walreceiver; In typical case, * The walreceiver receives WAL records, returns the ACK to the primary, saves them in the wal_buffers, and lets thestartup process know the arrival. * The walwriter writes and fsyncs the WAL records in the wal_buffers. * The startup process applies the WAL records in the wal_buffers when it receives the notice of the arrival. * The startup process and bgwriter fsyncs the WAL before the buffer flush. Of course, since this approach is too complicated, it's out of the scope of the development for v8.5. > But I also agree with Heikki. Let's plan to do this later in this > release. Okey. I implement nothing around this topic until the core part of asynchronous replication will have been committed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2009-11-12 at 21:45 +0900, Fujii Masao wrote: > But, as I said on my first post on this thread, even such low-frequent > fsync-WAL-before-buffer-flush might cause a response time spike on the > primary because the walreceiver must sleep during that fsync. I think > that leaving the WAL-logging business to another process like walwriter > is a good idea for reducing further the impact on the walreceiver; In > typical case, Agree completely. > Of course, since this approach is too complicated, it's out of the scope > of the development for v8.5. It's out of scope for phase 1, certainly. -- Simon Riggs www.2ndQuadrant.com
Fujii Masao <masao.fujii@gmail.com> writes: > The problem is that fsync needs to be issued too frequently, which would > be harmless in asynchronous replication, but not in synchronous one. > A transaction would have to wait for the primary's and standby's fsync > before returning a "success" to a client. Surely that is exactly what is *required* if the user has asked for synchronous replication. regards, tom lane
Tom Lane wrote: <blockquote cite="mid:24069.1258037544@sss.pgh.pa.us" type="cite"><pre wrap="">Fujii Masao <a class="moz-txt-link-rfc2396E"href="mailto:masao.fujii@gmail.com"><masao.fujii@gmail.com></a> writes: </pre><blockquotetype="cite"><pre wrap="">The problem is that fsync needs to be issued too frequently, which would be harmless in asynchronous replication, but not in synchronous one. A transaction would have to wait for the primary's and standby's fsync before returning a "success" to a client. </pre></blockquote><pre wrap=""> Surely that is exactly what is *required* if the user has asked for synchronous replication. </pre></blockquote> This a distressingly common thing people get wrong about replication. You caneither have synchronous replication, which as you say has to be slow: you must wait for an fsync ACK from the secondaryand a return trip before you can say something is committed on the primary. Or you can get better performance bynot waiting for all of those things, but the minute you do that it's *not* synchronous replication anymore. You can'tget high-performance and true synchronous behavior; you have to pick one. The best you can do if you need both is workon accelerating fsync everywhere using the standard battery-backed write cache technique.<br /><br /><pre class="moz-signature"cols="72">-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a> </pre>
On Fri, Nov 13, 2009 at 1:49 AM, Greg Smith <greg@2ndquadrant.com> wrote: > This a distressingly common thing people get wrong about replication. You > can either have synchronous replication, which as you say has to be slow: > you must wait for an fsync ACK from the secondary and a return trip before > you can say something is committed on the primary. Or you can get better > performance by not waiting for all of those things, but the minute you do > that it's *not* synchronous replication anymore. You can't get > high-performance and true synchronous behavior; you have to pick one. The > best you can do if you need both is work on accelerating fsync everywhere > using the standard battery-backed write cache technique. I'm not happy that such frequent fsyncs would harm even semi-synchronous replication (i.e., you must wait for a *recv* ACK from the secondary and a return trip before you can say something is committed on the primary. This corresponds to the DRBD's protocol B) rather than synchronous one. Personally, I think that semi-synchronous replication is sufficient for HA. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
* Fujii Masao <masao.fujii@gmail.com> [091112 20:52]: > Personally, I think that > semi-synchronous replication is sufficient for HA. Often, but that's not synchronous replication so don't call it such... -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Fujii Masao wrote: > Personally, I think that semi-synchronous replication is sufficient for HA. > Whether or not you think it's sufficient for what you have in mind, "synchronous replication" requires a return ACK from the secondary before you say things are committed on the primary. If you don't do that, it's not true sync replication anymore; it's asynchronous replication. Plenty of people decide that a local commit combined with a promise to synchronize as soon as possible to the slave is good enough for their apps, which as you say is getting referred to as "semi-synchronous replication" nowadays. That's an awful name though, because it's not true--that's asynchronous replication, just aiming for minimal lag. It's OK to say that's what you want, but you can't say it's really a synchronous commit anymore if you do things that way. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Fri, Nov 13, 2009 at 10:58 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > * Fujii Masao <masao.fujii@gmail.com> [091112 20:52]: > >> Personally, I think that >> semi-synchronous replication is sufficient for HA. > > Often, but that's not synchronous replication so don't call it such... Hmm, though I'm not sure about your definition of "synchronous", if the primary waits for a *redo* ACK from the standby before returning a "success" of a transaction to a client, you can call SR synchronous? This is one of TODO items of SR for v8.5. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Nov 13, 2009 at 11:15 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Whether or not you think it's sufficient for what you have in mind, > "synchronous replication" requires a return ACK from the secondary before > you say things are committed on the primary. If you don't do that, it's not > true sync replication anymore; it's asynchronous replication. Plenty of > people decide that a local commit combined with a promise to synchronize as > soon as possible to the slave is good enough for their apps, which as you > say is getting referred to as "semi-synchronous replication" nowadays. > That's an awful name though, because it's not true--that's asynchronous > replication, just aiming for minimal lag. It's OK to say that's what you > want, but you can't say it's really a synchronous commit anymore if you do > things that way. Umm... what is your definition of "synchronous"? I'm planning to provide four synchronization modes as follows, for v8.5. Does this fit in your thought? The primary waits ... before returning "success" of a transaction; * nothing - asynchronous replication * recv ACK -semi-synchronous replication * fsync ACK - semi-synchronous replication * redo ACK - synchronous replication Or, in synchronous replication, we must wait a fsync and a redo ACK? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Nov 13, 2009 at 2:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Umm... what is your definition of "synchronous"? I'm planning to provide > four synchronization modes as follows, for v8.5. Does this fit in your I think my definition would be that a query against the replica will produce the same result as a query against the master -- and that that will be the case even after a system failure. That might not necessarily mean that the log entry is fsynced on the replica, only that it's fsynced in a location where the replica will have access to it when it runs recovery. I do have a different question though. What do you plan to do if there's a failure when they're out of sync? The master hasn't responded to the commit yet because it's still waiting on the replica to respond but it has already recorded the commit itself. When it comes back up it's out of sync with the replica and has to resend those records? What if the replica has already received it and it was the confirmation which was lost? -- greg
Fujii Masao wrote: > Umm... what is your definition of "synchronous"? I'm planning to provide > four synchronization modes as follows, for v8.5. Does this fit in your > thought? > > The primary waits ... before returning "success" of a transaction; > * nothing - asynchronous replication > * recv ACK - semi-synchronous replication > * fsync ACK - semi-synchronous replication > * redo ACK - synchronous replication > > Or, in synchronous replication, we must wait a fsync and a redo ACK? > Right, those are the possibilities, all four of them have valid use cases in the field and are worth implementing. I don't like the label "semi-synchronous replication" myself, but it's a valuable feature to implement, and that is unfortunately the term other parts of the industry use for that approach. But everyone needs to be extremely careful with the terminology here: if you say "synchronous replication", that *only* means what you're labeling "redo ACK" ("WAL ACK" really). "Synchronous replication" should not be used as a group term that includes the semi-synchronous variations, which are in fact asynchronous despite their marketing name. If someone means semi-synchronous, but they say synchronous thinking it's a shared term also applicable to the semi-synchronous variations here, that's just going to be confusing for everyone. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Fri, Nov 13, 2009 at 11:54 AM, Greg Stark <gsstark@mit.edu> wrote: > I think my definition would be that a query against the replica will > produce the same result as a query against the master -- and that that > will be the case even after a system failure. That might not > necessarily mean that the log entry is fsynced on the replica, only > that it's fsynced in a location where the replica will have access to > it when it runs recovery. Agreed. > I do have a different question though. What do you plan to do if > there's a failure when they're out of sync? The master hasn't > responded to the commit yet because it's still waiting on the replica > to respond but it has already recorded the commit itself. When it > comes back up it's out of sync with the replica and has to resend > those records? What if the replica has already received it and it was > the confirmation which was lost? If the connection is not closed, the resending is not required because TCP would guarantee that such records arrive at the standby someday. Otherwise, the standby re-connects to the primary, and asks for the missing records, so the resending would be done. Since only the missing records are requested, the already received records don't reach the standby again, I think. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Nov 13, 2009 at 1:49 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Right, those are the possibilities, all four of them have valid use cases in > the field and are worth implementing. I don't like the label > "semi-synchronous replication" myself, but it's a valuable feature to > implement, and that is unfortunately the term other parts of the industry > use for that approach. BTW, MySQL and DRBD use the term "semi-synchronous": http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication http://www.drbd.org/users-guide/s-replication-protocols.html Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi Greg and Fujii, Just a point on terminology: there's a difference in the usage of semi-synchronous between DRBD and MySQL semi-synchronous replication, which was originally developed by Google. In the Google case semi-synchronous replication is a quorum algorithm where clients receive a commit notification only after at least one of N slaves has received the replication event. In the DRBD case semi-synchronous means that events have reached the slave but are not necessarily durable. There's no quorum. Of these two usages the Google semi-sync approach is the more interesting because it avoids the availability problems associated with fully synchronous operation but gets most of the durability benefits. Cheers, Robert On 11/12/09 9:29 PM PST, "Fujii Masao" <masao.fujii@gmail.com> wrote: > On Fri, Nov 13, 2009 at 1:49 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> Right, those are the possibilities, all four of them have valid use cases in >> the field and are worth implementing. I don't like the label >> "semi-synchronous replication" myself, but it's a valuable feature to >> implement, and that is unfortunately the term other parts of the industry >> use for that approach. > > BTW, MySQL and DRBD use the term "semi-synchronous": > http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication > http://www.drbd.org/users-guide/s-replication-protocols.html > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Fujii Masao wrote: > On Fri, Nov 13, 2009 at 1:49 PM, Greg Smith <greg@2ndquadrant.com> wrote: > >> Right, those are the possibilities, all four of them have valid use cases in >> the field and are worth implementing. I don't like the label >> "semi-synchronous replication" myself, but it's a valuable feature to >> implement, and that is unfortunately the term other parts of the industry >> use for that approach. >> > > BTW, MySQL and DRBD use the term "semi-synchronous": > http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication > http://www.drbd.org/users-guide/s-replication-protocols.html > Yeah, that's the "other parts of the industry" I was referring to. MySQL uses "semi-synchronous" to distinguish between its completely asynchronous default replication mode and one where it provides a somewhat safer implementation. The description reads more as "asynchronous with some synchronous elements", not "one style of synchronous implementation". None of their documentation wanders into the problem area here by calling it a true synchronous solution when it's really not--MySQL Cluster is their synchronous vehicle. It's fine to adopt the term "semi-synchronous", as it's become quite popular and people are going to label the PG implementation with it regardless of what is settled on here. But we should all try to be careful to use it as correctly as possible. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Fri, Nov 13, 2009 at 3:17 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Yeah, that's the "other parts of the industry" I was referring to. MySQL > uses "semi-synchronous" to distinguish between its completely asynchronous > default replication mode and one where it provides a somewhat safer > implementation. The description reads more as "asynchronous with some > synchronous elements", not "one style of synchronous implementation". None > of their documentation wanders into the problem area here by calling it a > true synchronous solution when it's really not--MySQL Cluster is their > synchronous vehicle. > It's fine to adopt the term "semi-synchronous", as it's become quite popular > and people are going to label the PG implementation with it regardless of > what is settled on here. But we should all try to be careful to use it as > correctly as possible. OK. Let's think over what "recv ACK" and "fsync ACK" synchronization modes should be called later. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, Greg Stark wrote: > I think my definition would be that a query against the replica will > produce the same result as a query against the master -- and that that > will be the case even after a system failure. That might not > necessarily mean that the log entry is fsynced on the replica, only > that it's fsynced in a location where the replica will have access to > it when it runs recovery. I tend to agree with that definition of synchrony for replicated databases. However, let me point to an earlier thread around the same topic: http://archives.postgresql.org/message-id/4942ECF7.5040601@bluegap.ch You will definitely find different definitions and requirements of what synchronous replication means there. It convinced me that "synchronous" is more of a marketing term in this area and is better avoided in technical documents and discussions, or needs explanation. As far as marketing goes, there are the customers who absolutely want synchronous replication for its consistency and then there are the others who absolutely don't want it due to its unusably high latency. Regards Markus Wanner
Markus Wanner wrote: > You will definitely find different definitions and requirements of what > synchronous replication means there. To quote from the Wikipedia entry on "Database Replication" that Simon pointed to during the earlier discussion, http://en.wikipedia.org/wiki/Database_replication "Synchronous replication - guarantees "zero data loss" by the means of atomic write operation, i.e. write either completes on both sides or not at all. Write is not considered complete until acknowledgement by both local and remote storage." That last part is the critical one: "acknowledgement by both local and remote storage" is required before you can label something truly synchronous replication. In implementation terms, that means you must have both local and slave fsync calls finish to be considered truly synchronous. That part is not ambiguous at all. There's a definition of the weaker form in there too, which is where the ambiguity is at: "Semi-synchronous replication - this usually means that a write is considered complete as soon as local storage acknowledges it and a remote server acknowledges that it has received the write either into memory or to a dedicated log file." I don't consider that really synchronous replication anymore, but as you say it's been strengthened by marketing enough to be a valid industry term at this point. Since it's already gained traction we might use it, as long as it's defined properly and its trade-offs vs. a true synchronous implementation are documented. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Hi, Quoting "Greg Smith" <greg@2ndquadrant.com>: > "Synchronous replication - guarantees "zero data loss" by the means > of atomic write operation, i.e. write either completes on both sides > or not at all. Write is not considered complete until > acknowledgement by both local and remote storage." Note that a storage acknowledge (hopefully) guarantees durability, but it does not necessarily mean that the transactional changes are immediately visible on a remote node. Which is what you had in your definition. My point is that there are at least three things that can run synchronously or not, WRT to distributed databases: 1. conflict detection and handling (for consistency) 2. storage acknowledgement (for durability) 3. effective applicationof changes (for visibility across nodes) > That last part is the critical one: "acknowledgement by both local > and remote storage" is required before you can label something truly > synchronous replication. In implementation terms, that means you > must have both local and slave fsync calls finish to be considered > truly synchronous. That part is not ambiguous at all. I personally agree 100%. (Given it implies a congruent conflict handling *before* the disk write. Having conflicting transactional changes on the disk wouldn't help much at recovery time). (And yes, this means I think the effective application of changes can be deferred. IMO the load balancer and/or the application should take care not to send transactions from the same session to different nodes). > "Semi-synchronous replication ..is plain non-sense to my ears. Either something is synchronous or it is not. No half, no semi, no virtual synchrony. To have any technical relevance, one needs to add *what* is synchronous and what not. In that spirit I have to admit that the term 'eager' that I'm currently using to describe Postgres-R may not be any more helpful. I take it to mean synchrony of 1. and 2., but not 3. Regards Markus Wanner