Thread: Inconsistent DB data in Streaming Replication
<div dir="ltr"><p class="">Hello,<p class="">We have been trying to figure out possible solutions to the following problemin streaming replication Consider following scenario:<p class="">If master receives commit command, it writes andflushes commit WAL records to the disk, It also writes and flushes data page related to this transaction.<p class="">Themaster then sends WAL records to standby up to the commit WAL record. But before sending these records if failoverhappens then, old master is ahead of standby which is now the new master in terms of DB data leading to inconsistentdata . <p class=""> <p class="">One solution to avoid this situation is have the master send WAL records to standbyand wait for ACK from standby committing WAL files to disk and only after that commit data page related to this transactionon master.<p class="">The main drawback would be increased wait time for the client due to extra round trip tostandby before master sends ACK to client. Are there any other issues with this approach?<p class=""><br /><p class="">Thankyou,<p class=""><p class="">Samrat</div>
On 04/08/2013 05:34 AM, Samrat Revagade wrote: > One solution to avoid this situation is have the master send WAL > records to standby and wait for ACK from standby committing WAL files > to disk and only after that commit data page related to this > transaction on master. Isn't this basically what synchronous replication does in PG 9.1+? -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Samrat Revagade <revagade.samrat@gmail.com> writes: > We have been trying to figure out possible solutions to the following > problem in streaming replication Consider following scenario: > If master receives commit command, it writes and flushes commit WAL records > to the disk, It also writes and flushes data page related to this > transaction. > The master then sends WAL records to standby up to the commit WAL record. > But before sending these records if failover happens then, old master is > ahead of standby which is now the new master in terms of DB data leading > to inconsistent data . I don't exactly see the problem ... unless you're imagining that master and slave share the same data storage or something like that. That's not going to work for a ton of reasons besides this one. regards, tom lane
On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 04/08/2013 05:34 AM, Samrat Revagade wrote: > >> One solution to avoid this situation is have the master send WAL >> records to standby and wait for ACK from standby committing WAL files >> to disk and only after that commit data page related to this >> transaction on master. > > > Isn't this basically what synchronous replication does in PG 9.1+? Not exactly. Sync-rep ensures that commit success is not sent to the client before a synchronous replica acks the commit record. What Samrat is proposing here is that WAL is not flushed to the OS before it is acked by a synchronous replica so recovery won't go past the timeline change made in failover, making it necessary to take a new base backup to resync with the new master. I seem to remember this being discussed when sync rep was committed. I don't recall if the idea was discarded only on performance grounds or whether there were other issues too. Thinking about it now it, the requirement is that after crash and failover to a sync replica we should be able to reuse the datadir to replicate from the new master without consistency. We should be able to achieve that by ensuring that we don't write out pages until we have received an ack from the sync replica and that we check for possible timeline switches before recovering local WAL. For the first, it seems to me that it should be enough to rework the updating of XlogCtl->LogwrtResult.Flush so it accounts for the sync replica. For the second part, I think Heikkis work on enabling timeline switches over streaming connections already ensure this (I haven't checked it out in detail), but if not, shouldn't be too hard to add. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2013-04-08 19:26:33 +0300, Ants Aasma wrote: > On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > > On 04/08/2013 05:34 AM, Samrat Revagade wrote: > > > >> One solution to avoid this situation is have the master send WAL > >> records to standby and wait for ACK from standby committing WAL files > >> to disk and only after that commit data page related to this > >> transaction on master. > > > > > > Isn't this basically what synchronous replication does in PG 9.1+? > > Not exactly. Sync-rep ensures that commit success is not sent to the > client before a synchronous replica acks the commit record. What > Samrat is proposing here is that WAL is not flushed to the OS before > it is acked by a synchronous replica so recovery won't go past the > timeline change made in failover, making it necessary to take a new > base backup to resync with the new master. I seem to remember this > being discussed when sync rep was committed. I don't recall if the > idea was discarded only on performance grounds or whether there were > other issues too. Thats not going to work for a fair number of reasons: * wal is streamed *from disk* not from memory * what if the local node crashes/restarts immediately? Then the standby is farther ahead than the master. * the performance implications of never writing data before flushing it are pretty severe * ... So this doesn't seem to solve anything. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 8, 2013 at 7:38 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-04-08 19:26:33 +0300, Ants Aasma wrote: >> Not exactly. Sync-rep ensures that commit success is not sent to the >> client before a synchronous replica acks the commit record. What >> Samrat is proposing here is that WAL is not flushed to the OS before >> it is acked by a synchronous replica so recovery won't go past the >> timeline change made in failover, making it necessary to take a new >> base backup to resync with the new master. I seem to remember this >> being discussed when sync rep was committed. I don't recall if the >> idea was discarded only on performance grounds or whether there were >> other issues too. > > Thats not going to work for a fair number of reasons: > * wal is streamed *from disk* not from memory Yeah, this one alone makes the do-not-flush-before-replicating approach impractical. > * what if the local node crashes/restarts immediately? Then the standby > is farther ahead than the master. > * the performance implications of never writing data before flushing it > are pretty severe > * ... > > So this doesn't seem to solve anything. Yeah, delaying WAL writes until replication is successful seems impractical, but I don't see why we couldn't optionally take into account walsender write pointers when considering if we can write out a page. Sure there will be some performance hit for waiting to replicate WAL, but on the other hand having to rsync a huge database isn't too good for performance either. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Mon, Apr 8, 2013 at 7:34 PM, Samrat Revagade <revagade.samrat@gmail.com> wrote: > > Hello, > > We have been trying to figure out possible solutions to the following problem in streaming replication Consider followingscenario: > > If master receives commit command, it writes and flushes commit WAL records to the disk, It also writes and flushes datapage related to this transaction. > > The master then sends WAL records to standby up to the commit WAL record. But before sending these records if failoverhappens then, old master is ahead of standby which is now the new master in terms of DB data leading to inconsistentdata . Why do you think that the inconsistent data after failover happens is problem? Because it's one of the reasons why a fresh base backup is required when starting old master as new standby? If yes, I agree with you. I've often heard the complaints about a backup when restarting new standby. That's really big problem. The timeline mismatch after failover was one of the reasons why a backup is required. But, thanks to Heikki's recent work, that's solved, i.e., the timeline mismatch would be automatically resolved when starting replication in 9.3. So, the remaining problem is an inconsistent database. > One solution to avoid this situation is have the master send WAL records to standby and wait for ACK from standby committingWAL files to disk and only after that commit data page related to this transaction on master. You mean to make the master wait the data page write until WAL has been not only flushed to disk but also replicated to the standby? > The main drawback would be increased wait time for the client due to extra round trip to standby before master sends ACKto client. Are there any other issues with this approach? I think that you can introduce GUC specifying whether this extra check is required to avoid a backup when failback. Regards, -- Fujii Masao
<div dir="ltr"><div style="border-width:1.5pt 1pt;border-color:rgb(211,211,211);border-style:solid;padding:12pt 0in 12pt24pt"><p class="" style="margin-bottom:0.0001pt;border:none;padding:0in"><p class="" style="margin-bottom:0.0001pt"><spanstyle="font-size:10pt;font-family:Arial,sans-serif">>What Samrat is proposing hereis that WAL is not flushed to the OS before</span><p class="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">>itis acked by a synchronous replica so recovery won't go past the</span><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">>timelinechange made in failover, making it necessary to take a new</span><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">>base backupto resync with the new master.</span><p><p class="" style="margin-bottom:0.0001pt;border:none;padding:0in"><p class=""style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">Actually we are proposingthat the data page on the master is not committed till master receives ACK from the standby. The WAL files can beflushed to the disk on both the master and standby, before standby generates ACK to master. The end objective is the sameof avoiding to take base backup of old master to resync with new master.</span><p class="" style="margin-bottom:0.0001pt"><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">>Whydo you think that the inconsistent data after failover happensis</span><span style="font-size:10pt;font-family:Arial,sans-serif"><br /><span style="background-repeat:initial initial">>problem?Because </span></span><p class="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">>it'sone of the reasons why a fresh base backup is required when</span><spanstyle="font-size:10pt;font-family:Arial,sans-serif"><br /><span style="background-repeat:initial initial">>startingold master as</span><br /><span style="background-repeat:initial initial">>new standby? If yes, Iagree with you. I've often heard the complaints</span><br /><span style="background-repeat:initial initial">>about abackup</span><br /><span style="background-repeat:initial initial">>when restarting new standby. That's really big problem.</span></span><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif"> </span><span style="font-family:Arial,sans-serif;font-size:10pt">Yes,taking backup is</span><span style="font-family:Arial,sans-serif;font-size:10pt"> </span><span style="font-family:Arial,sans-serif;font-size:10pt">majorproblem when the database size is more than several TB. It wouldtake very long time to ship backup data over the slow WAN network. </span><p class="" style="margin-bottom:12pt"><spanstyle="font-size:10pt;font-family:Arial,sans-serif;color:rgb(80,0,80)">>> One solutionto avoid this situation is have the master send WAL records to standby and wait for ACK from standby committing WALfiles to disk and only after that commit data page related to this transaction on master.</span><p class="" style="margin-bottom:0.0001pt"><spanstyle="font-size:10pt;font-family:Arial,sans-serif">>You mean to make the master waitthe data page write until WAL has been not only</span><span style="font-size:10pt;font-family:Arial,sans-serif"><br /><spanstyle="background-repeat:initial initial">>flushed to disk but also replicated to the standby?</span></span><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif"> </span><span style="font-size:10pt;font-family:Arial,sans-serif">Yes.Master should not write the data page before corresponding WAL recordshave been replicated to the standby. The WAL records have been flushed to disk on both master and standby.</span><pclass="" style="margin-bottom:12pt"><span style="font-size:10pt;font-family:Arial,sans-serif;color:rgb(80,0,80)">>>The main drawback would be increased waittime for the client due to extra round trip to standby before master sends ACK to client. Are there any other issueswith this approach?</span><p class=""><span style="font-size:10pt;line-height:115%;font-family:Arial,sans-serif">>Ithink that you can introduce GUC specifying whetherthis extra check</span><span style="font-size:10pt;line-height:115%;font-family:Arial,sans-serif"><br /><span style="background-repeat:initialinitial">>is required to avoid a backup when failback</span></span><p class=""><p class=""style="margin-bottom:0.0001pt"><span style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%">That wouldbe better idea. We can disable it whenever taking a fresh backup is not a problem. </span><span style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span style="font-family:Arial,sans-serif;font-size:10pt;line-height:115%"> </span><span style="font-size:10pt;font-family:Arial,sans-serif"> </span><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif"><br/></span><p class="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">Regards,</span><pclass="" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif">Samrat </span><span style="font-size:10pt;font-family:Arial,sans-serif"> </span><p><p></div></div><divclass="gmail_extra"><br /><br /><div class="gmail_quote">OnMon, Apr 8, 2013 at 10:40 PM, Fujii Masao <span dir="ltr"><<a href="mailto:masao.fujii@gmail.com"target="_blank">masao.fujii@gmail.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Mon, Apr 8, 2013at 7:34 PM, Samrat Revagade<br /> <<a href="mailto:revagade.samrat@gmail.com">revagade.samrat@gmail.com</a>> wrote:<br/> ><br /> > Hello,<br /> ><br /> > We have been trying to figure out possible solutions to the followingproblem in streaming replication Consider following scenario:<br /> ><br /> > If master receives commit command,it writes and flushes commit WAL records to the disk, It also writes and flushes data page related to this transaction.<br/> ><br /> > The master then sends WAL records to standby up to the commit WAL record. But before sendingthese records if failover happens then, old master is ahead of standby which is now the new master in terms of DBdata leading to inconsistent data .<br /><br /></div>Why do you think that the inconsistent data after failover happensis<br /> problem? Because<br /> it's one of the reasons why a fresh base backup is required when<br /> starting oldmaster as<br /> new standby? If yes, I agree with you. I've often heard the complaints<br /> about a backup<br /> whenrestarting new standby. That's really big problem.<br /><br /> The timeline mismatch after failover was one of the reasonswhy a<br /> backup is required.<br /> But, thanks to Heikki's recent work, that's solved, i.e., the timeline<br />mismatch would be<br /> automatically resolved when starting replication in 9.3. So, the<br /> remaining problem is an<br/> inconsistent database.<br /><div class="im"><br /> > One solution to avoid this situation is have the master sendWAL records to standby and wait for ACK from standby committing WAL files to disk and only after that commit data pagerelated to this transaction on master.<br /><br /></div>You mean to make the master wait the data page write until WALhas been not only<br /> flushed to disk but also replicated to the standby?<br /><div class="im"><br /> > The maindrawback would be increased wait time for the client due to extra round trip to standby before master sends ACK to client.Are there any other issues with this approach?<br /><br /></div>I think that you can introduce GUC specifying whetherthis extra check<br /> is required to<br /> avoid a backup when failback.<br /><br /> Regards,<br /><br /> --<br />Fujii Masao<br /></blockquote></div><br /></div>
On Tue, Apr 9, 2013 at 9:42 AM, Samrat Revagade <revagade.samrat@gmail.com> wrote: >>What Samrat is proposing here is that WAL is not flushed to the OS before >>it is acked by a synchronous replica so recovery won't go past the >>timeline change made in failover, making it necessary to take a new >>base backup to resync with the new master. > > Actually we are proposing that the data page on the master is not committed > till master receives ACK from the standby. The WAL files can be flushed to > the disk on both the master and standby, before standby generates ACK to > master. The end objective is the same of avoiding to take base backup of old > master to resync with new master. Sorry for misreading your e-mail. It seems like we are on the same page here. I too have found this an annoying limitation in using replication in an unreliable environment. > Yes, taking backup is major problem when the database size is more than > several TB. It would take very long time to ship backup data over the slow > WAN network. For WAN environment rsync can be a good enough answer, a tiny amount of pages will be actually transferred. This is assuming a smallish database and low bandwidth. For larger databases avoiding the need to read in the whole database for differences is an obvious win. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 04/08/2013 12:34 PM, Samrat Revagade wrote: > > Hello, > > We have been trying to figure out possible solutions to the following > problem in streaming replication Consider following scenario: > > If master receives commit command, it writes and flushes commit WAL > records to the disk, It also writes and flushes data page related to > this transaction. > No data page flushing takes place. All data page writing is delayed to bgWriter and/or checkpoints. > > The master then sends WAL records to standby up to the commit WAL > record. But before sending these records if failover happens then, > old master is ahead of standby which is now the new master in terms > of DB data leading to inconsistent data . > The master sends WAL records to standby continuously, not "upon commit wal record". In case of syncrep the master just waits for confirmation from standby before returning to client on commit. > > One solution to avoid this situation is have the master send WAL > records to standby and wait for ACK from standby committing WAL files > to disk and only after that commit data page related to this > transaction on master. > Not just commit, you must stop any *writing* of the wal records effectively killing any parallelism. > > The main drawback would be increased wait time for the client due to > extra round trip to standby before master sends ACK to client. Are > there any other issues with this approach? > Min issue is that it will make *all* backends dependant on each sync commit, essentially serialising all backends commits, with the serialisation *including* the latency of roundtrip to client. With current sync streaming the other backends can continue to write wal, with proposed approach you can not write any records after the one waiting an ACK from standby. > > > Thank you, > > Samrat >
<div dir="ltr"><p class="">>>it's one of the reasons why a fresh base backup is required when starting old master asnew standby? >>If yes, I agree with you. I've often heard the complaints about a backup when restarting new standby.>>That's really big problem.<p class="">I think Fujii Masao is on the same page.<p class=""> <p class="">>Incase of syncrep the master just waits for confirmation from standby before returning to client on >commit.<pclass="">>Not just commit, you must stop any *writing* of the wal records effectively killing any parallelism.<br/> > Min issue is that it will make *all* backends dependant on each sync commit, essentially serialisingall >backends commits, with the serialisation *including* the latency of roundtrip to client. With current>sync streaming the other backends can continue to write wal, with proposed approach you cannot >write any recordsafter the one waiting an ACK from standby.<p class=""> <p class="">Let me rephrase the proposal in a more accuratemanner:<p class="">Consider following scenario:<p class=""> <p class="">(1) A client sends the "COMMIT" command tothe master server.<p class=""><p class="">(2) The master writes WAL record to disk<p class="">(3) The master writes thedata page related to this transaction. i.e. via checkpoint or bgwriter.<p class="">(4) The master sends WAL records continuouslyto the standby, up to the commit WAL record.<p class="">(5) The standby receives WAL records, writes them tothe disk, and then replies the ACK.<p class="">(6) The master returns a success indication to a client after it receivesACK.<p class=""> <p class="">If failover happens between (3) and (4), WAL and DB data in old master are ahead ofthem in new master. After failover, new master continues running new transactions independently from old master. Then WALrecord and DB data would become inconsistent between those two servers. To resolve these inconsistencies, the backup ofnew master needs to be taken onto new standby.<p class=""><br /><p class="">But taking backup is not feasible in case oflarger database size with several TB over a slow WAN.<br /><p class=""><p class="">So to avoid this type of inconsistencywithout taking fresh backup we are thinking to do following thing:<p class=""> <br /><p class="">>> Ithink that you can introduce GUC specifying whether this extra check is required to avoid a backup >>when failback.<pclass="">Approach:<p class="">Introduce new GUC option specifying whether to prevent PostgreSQL from writing DBdata before corresponding WAL records have been replicated to the standby. That is, if this GUC option is enabled, PostgreSQLwaits for corresponding WAL records to be not only written to the disk but also replicated to the standby beforewriting DB data.<p class=""><br /><p class="">So the process becomes as follows:<p class="">(1) A client sends the"COMMIT" command to the master server.<p class="">(2) The master writes the commit WAL record to the disk.<p class="">(3)The master sends WAL records continuously to standby up to the commit WAL record.<p class="">(4) The standbyreceives WAL records, writes them to disk, and then replies the ACK.<p class="">(5) <b>The master then forces a writeof the data page related to this transaction. </b><p class="">(6) The master returns a success indication to a clientafter it receives ACK.<p class=""> <p class="">While master is waiting to force a write (point 5) for this data page,streaming replication continuous. Also other data page writes are not dependent on this particular page write. So thecommit of data pages are not serialized.<p class="" style="style"><br /><p class="" style="style">Regards,<p class=""style="style">Samrat<p class=""><br /></div>
<div dir="ltr"><br />>(5) <b>The master then forces a write of the data page related to this transaction.</b><p class=""><b>Sorry,this is incorrect. Whenever the master writes the data page it checks that the WAL record is written instandby till that LSN. </b><p class=""><b><br /></b><p class=""><p class="">>While master is waiting to force a write(point 5) for this data page, streaming replication continuous. <p class="">>Also other data page writes are notdependent on this particular page write. So the commit of data >pages are not serialized.<p class=""><b>Sorry, thisis incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checksthat the WAL record is written in standby till that LSN.</b><p class=""><b><br /></b><p class="" style="style"><b>Regards,</b><pclass="" style="style"><b>Samrat</b><p class="" style="style"><b><br /></b><p></div>
On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: >>(5) The master then forces a write of the data page related to this transaction. > Sorry, this is incorrect. Whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. >> While master is waiting to force a write (point 5) for this data page, streaming replication continuous. >> Also other data page writes are not dependent on this particular page write. So the commit of data >pages are not serialized. >Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. I am not sure it will resolve the problem completely as your old-master can have some WAL extra then new-master for same timeline. I don't remember exactly will timeline switch feature take care of this extra WAL, Heikki can confirm this point? Also I think this can serialize flush of data pages in checkpoint/bgwriter which is currently not the case. With Regards, Amit Kapila.
Amit Kapila <amit.kapila@huawei.com> writes: > On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: >> Sorry, this is incorrect. Streaming replication continuous, master is not >> waiting, whenever the master writes the data page it checks that the WAL >> record is written in standby till that LSN. > I am not sure it will resolve the problem completely as your old-master can > have some WAL extra then new-master for same timeline. I don't remember > exactly will timeline switch feature > take care of this extra WAL, Heikki can confirm this point? > Also I think this can serialize flush of data pages in checkpoint/bgwriter > which is currently not the case. Yeah. TBH this entire discussion seems to be "let's cripple performance in the normal case so that we can skip doing an rsync when resurrecting a crashed, failed-over master". This is not merely optimizing for the wrong thing, it's positively hazardous. After a fail-over, you should be wondering whether it's safe to resurrect the old master at all, not about how fast you can bring it back up without validating its data. IOW, I wouldn't consider skipping the rsync even if I had a feature like this. regards, tom lane
On 2013-04-10 10:10:31 -0400, Tom Lane wrote: > Amit Kapila <amit.kapila@huawei.com> writes: > > On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: > >> Sorry, this is incorrect. Streaming replication continuous, master is not > >> waiting, whenever the master writes the data page it checks that the WAL > >> record is written in standby till that LSN. > > > I am not sure it will resolve the problem completely as your old-master can > > have some WAL extra then new-master for same timeline. I don't remember > > exactly will timeline switch feature > > take care of this extra WAL, Heikki can confirm this point? > > Also I think this can serialize flush of data pages in checkpoint/bgwriter > > which is currently not the case. > > Yeah. TBH this entire discussion seems to be "let's cripple performance > in the normal case so that we can skip doing an rsync when resurrecting > a crashed, failed-over master". This is not merely optimizing for the > wrong thing, it's positively hazardous. After a fail-over, you should > be wondering whether it's safe to resurrect the old master at all, not > about how fast you can bring it back up without validating its data. > IOW, I wouldn't consider skipping the rsync even if I had a feature > like this. Agreed. Especially as in situations where you fall over in a planned way, e.g. for a hardware upgrade, you can avoid the need to resync with a littlebit of care. So its mostly in catastrophic situations this becomes a problem and in those you really should resync - and its a good idea not to use a normal rsync but a rsync --checksum or similar. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 04/10/2013 09:10 AM, Tom Lane wrote: > IOW, I wouldn't consider skipping the rsync even if I had a feature > like this. Totally. Out in the field, we consider the "old" database corrupt the moment we fail over. There is literally no way to verify the safety of any data along the broken chain, given race conditions and multiple potential failure points. The only potential use case for this that I can see, would be for system maintenance and a controlled failover. I agree: that's a major PITA when doing DR testing, but I personally don't think this is the way to fix that particular edge case. Maybe checksums will fix this in the long run... I don't know. DRBD has a handy block-level verify function for things like this, and it can re-sync master/slave data by comparing the commit log across the servers if you tell it one node should be considered incorrect. The thing is... we have clogs, and we have WAL. If we can assume bidirectional communication and verification (checksum comparison?) of both of those components, the database *should* be able to re-sync itself. Even if that were possible given the internals, I can't see anyone jumping on this before 9.4 or 9.5 unless someone sponsors the feature. Automatic re-sync would (within available WALs) be an awesome feature, though... -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Wed, Apr 10, 2013 at 11:26 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 04/10/2013 09:10 AM, Tom Lane wrote: > >> IOW, I wouldn't consider skipping the rsync even if I had a feature >> like this. > > > Totally. Out in the field, we consider the "old" database corrupt the moment > we fail over. Strange. If this is really true, shared disk failover solution is fundamentally broken because the standby needs to start up with the shared "corrupted" database at the failover. Also, we cannot trust the crash recovery at all if we adopt the same logic as you think. I think that there are the cases where we can replay and reuse the old database even after PostgreSQL crashes. Regards, -- Fujii Masao
On 04/10/2013 11:40 AM, Fujii Masao wrote: > Strange. If this is really true, shared disk failover solution is > fundamentally broken because the standby needs to start up with the > shared "corrupted" database at the failover. How so? Shared disk doesn't use replication. The point I was trying to make is that replication requires synchronization between two disparate servers, and verifying they have exactly the same data is a non-trivial exercise. Even a single transaction after a failover (effectively) negates the old server because there's no easy "catch up" mechanism yet. Even if this isn't necessarily true, it's the safest approach IMO. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-04-10 10:10:31 -0400, Tom Lane wrote: >> Amit Kapila <amit.kapila@huawei.com> writes: >> > On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: >> >> Sorry, this is incorrect. Streaming replication continuous, master is not >> >> waiting, whenever the master writes the data page it checks that the WAL >> >> record is written in standby till that LSN. >> >> > I am not sure it will resolve the problem completely as your old-master can >> > have some WAL extra then new-master for same timeline. I don't remember >> > exactly will timeline switch feature >> > take care of this extra WAL, Heikki can confirm this point? >> > Also I think this can serialize flush of data pages in checkpoint/bgwriter >> > which is currently not the case. >> >> Yeah. TBH this entire discussion seems to be "let's cripple performance >> in the normal case so that we can skip doing an rsync when resurrecting >> a crashed, failed-over master". This is not merely optimizing for the >> wrong thing, it's positively hazardous. After a fail-over, you should >> be wondering whether it's safe to resurrect the old master at all, not >> about how fast you can bring it back up without validating its data. >> IOW, I wouldn't consider skipping the rsync even if I had a feature >> like this. > > Agreed. Especially as in situations where you fall over in a planned > way, e.g. for a hardware upgrade, you can avoid the need to resync with > a littlebit of care. It's really worth documenting that way. > So its mostly in catastrophic situations this > becomes a problem and in those you really should resync - and its a good > idea not to use a normal rsync but a rsync --checksum or similar. If database is very large, rsync --checksum takes very long. And I'm concerned that most of data pages in master has the different checksum from those in the standby because of commit hint bit. I'm not sure how rsync --checksum can speed up the backup after failover. Regards, -- Fujii Masao
On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 04/10/2013 11:40 AM, Fujii Masao wrote: > >> Strange. If this is really true, shared disk failover solution is >> fundamentally broken because the standby needs to start up with the >> shared "corrupted" database at the failover. > > > How so? Shared disk doesn't use replication. The point I was trying to make > is that replication requires synchronization between two disparate servers, > and verifying they have exactly the same data is a non-trivial exercise. > Even a single transaction after a failover (effectively) negates the old > server because there's no easy "catch up" mechanism yet. Hmm... ISTM what Samrat is proposing can resolve the problem. That is, if we can think that any data page which has not been replicated to the standby is not written in the master, new standby (i.e., old master) can safely catch up with new master (i.e., old standby). In this approach, of course, new standby might have some WAL records which new master doesn't have, so before starting up new standby, we need to remove all the WAL files in new standby and retrieve any WAL files from new master. But, what's the problem in his approach? Regards, -- Fujii Masao
On Wed, Apr 10, 2013 at 7:44 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > On 04/10/2013 11:40 AM, Fujii Masao wrote: > >> Strange. If this is really true, shared disk failover solution is >> fundamentally broken because the standby needs to start up with the >> shared "corrupted" database at the failover. > > > How so? Shared disk doesn't use replication. The point I was trying to make > is that replication requires synchronization between two disparate servers, > and verifying they have exactly the same data is a non-trivial exercise. > Even a single transaction after a failover (effectively) negates the old > server because there's no easy "catch up" mechanism yet. > > Even if this isn't necessarily true, it's the safest approach IMO. We already rely on WAL-before-data to ensure correct recovery. What is proposed here is to slightly redefine it to require WAL to be replicated before it is considered to be flushed. This ensures that no data page on disk differs from the WAL that the slave has. The machinery to do this is already mostly there, we already wait for WAL flushes and we know the write location on the slave. The second requirement is that we never start up as master and we don't trust any local WAL. This is actually how pacemaker clusters work, you would only need to amend the RA to wipe the WAL and configure postgresql with restart_after_crash = false. It would be very helpful in restoring HA capability after failover if we wouldn't have to read through the whole database after a VM goes down and is migrated with the shared disk onto a new host. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Ants Aasma <ants@cybertec.at> writes: > We already rely on WAL-before-data to ensure correct recovery. What is > proposed here is to slightly redefine it to require WAL to be > replicated before it is considered to be flushed. This ensures that no > data page on disk differs from the WAL that the slave has. The > machinery to do this is already mostly there, we already wait for WAL > flushes and we know the write location on the slave. The second > requirement is that we never start up as master and we don't trust any > local WAL. This is actually how pacemaker clusters work, you would > only need to amend the RA to wipe the WAL and configure postgresql > with restart_after_crash = false. > It would be very helpful in restoring HA capability after failover if > we wouldn't have to read through the whole database after a VM goes > down and is migrated with the shared disk onto a new host. The problem with this is it's making an idealistic assumption that a crashed master didn't do anything wrong or lose/corrupt any data during its crash. As soon as you realize that's an unsafe assumption, the whole thing becomes worthless to you. If the idea had zero implementation cost, I would say "sure, let people play with it until they find out (probably the hard way) that it's a bad idea". But it's going to introduce, at the very least, additional complexity into a portion of the system that is critical and plenty complicated enough already. That being the case, I don't want it there at all, not even as an option. regards, tom lane
2013-04-10 18:46 keltezéssel, Fujii Masao írta: > On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-04-10 10:10:31 -0400, Tom Lane wrote: >>> Amit Kapila <amit.kapila@huawei.com> writes: >>>> On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: >>>>> Sorry, this is incorrect. Streaming replication continuous, master is not >>>>> waiting, whenever the master writes the data page it checks that the WAL >>>>> record is written in standby till that LSN. >>>> I am not sure it will resolve the problem completely as your old-master can >>>> have some WAL extra then new-master for same timeline. I don't remember >>>> exactly will timeline switch feature >>>> take care of this extra WAL, Heikki can confirm this point? >>>> Also I think this can serialize flush of data pages in checkpoint/bgwriter >>>> which is currently not the case. >>> Yeah. TBH this entire discussion seems to be "let's cripple performance >>> in the normal case so that we can skip doing an rsync when resurrecting >>> a crashed, failed-over master". This is not merely optimizing for the >>> wrong thing, it's positively hazardous. After a fail-over, you should >>> be wondering whether it's safe to resurrect the old master at all, not >>> about how fast you can bring it back up without validating its data. >>> IOW, I wouldn't consider skipping the rsync even if I had a feature >>> like this. >> Agreed. Especially as in situations where you fall over in a planned >> way, e.g. for a hardware upgrade, you can avoid the need to resync with >> a littlebit of care. > It's really worth documenting that way. > >> So its mostly in catastrophic situations this >> becomes a problem and in those you really should resync - and its a good >> idea not to use a normal rsync but a rsync --checksum or similar. > If database is very large, rsync --checksum takes very long. And I'm concerned > that most of data pages in master has the different checksum from those in the > standby because of commit hint bit. I'm not sure how rsync --checksum can > speed up the backup after failover. "rsync --checksum" alone may not but "rsync --inplace" may speed up backup a lot. > > Regards, > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
On 2013-04-10 20:39:25 +0200, Boszormenyi Zoltan wrote: > 2013-04-10 18:46 keltezéssel, Fujii Masao írta: > >On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >>On 2013-04-10 10:10:31 -0400, Tom Lane wrote: > >>>Amit Kapila <amit.kapila@huawei.com> writes: > >>>>On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: > >>>>>Sorry, this is incorrect. Streaming replication continuous, master is not > >>>>>waiting, whenever the master writes the data page it checks that the WAL > >>>>>record is written in standby till that LSN. > >>>>I am not sure it will resolve the problem completely as your old-master can > >>>>have some WAL extra then new-master for same timeline. I don't remember > >>>>exactly will timeline switch feature > >>>>take care of this extra WAL, Heikki can confirm this point? > >>>>Also I think this can serialize flush of data pages in checkpoint/bgwriter > >>>>which is currently not the case. > >>>Yeah. TBH this entire discussion seems to be "let's cripple performance > >>>in the normal case so that we can skip doing an rsync when resurrecting > >>>a crashed, failed-over master". This is not merely optimizing for the > >>>wrong thing, it's positively hazardous. After a fail-over, you should > >>>be wondering whether it's safe to resurrect the old master at all, not > >>>about how fast you can bring it back up without validating its data. > >>>IOW, I wouldn't consider skipping the rsync even if I had a feature > >>>like this. > >>Agreed. Especially as in situations where you fall over in a planned > >>way, e.g. for a hardware upgrade, you can avoid the need to resync with > >>a littlebit of care. > >It's really worth documenting that way. > > > >>So its mostly in catastrophic situations this > >>becomes a problem and in those you really should resync - and its a good > >>idea not to use a normal rsync but a rsync --checksum or similar. > >If database is very large, rsync --checksum takes very long. And I'm concerned > >that most of data pages in master has the different checksum from those in the > >standby because of commit hint bit. I'm not sure how rsync --checksum can > >speed up the backup after failover. Its not about speed, its about correctness. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, April 10, 2013 10:31 PM Fujii Masao wrote: > On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas > <sthomas@optionshouse.com> wrote: > > On 04/10/2013 11:40 AM, Fujii Masao wrote: > > > >> Strange. If this is really true, shared disk failover solution is > >> fundamentally broken because the standby needs to start up with the > >> shared "corrupted" database at the failover. > > > > > > How so? Shared disk doesn't use replication. The point I was trying > to make > > is that replication requires synchronization between two disparate > servers, > > and verifying they have exactly the same data is a non-trivial > exercise. > > Even a single transaction after a failover (effectively) negates the > old > > server because there's no easy "catch up" mechanism yet. > > Hmm... ISTM what Samrat is proposing can resolve the problem. That is, > if we can think that any data page which has not been replicated to the > standby > is not written in the master, new standby (i.e., old master) can safely > catch up > with new master (i.e., old standby). In this approach, of course, new > standby > might have some WAL records which new master doesn't have, so before > starting up new standby, we need to remove all the WAL files in new > standby > and retrieve any WAL files from new master. But, what's the problem in > his > approach? Consider the case old-master crashed during flushing the data page, now you would need full page image from new-master. It might so happen that in new-master Checkpoint would have purged (reused) the log file's from that time line, in that case it will be difficult to get the full page image, user can refer WAL archive for that, but I think it will not be straight forward. One more point, what will be the new behavior when there are 2 transactions one has synchronous_commit =off and other with on? With Regards, Amit Kapila.
On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila <amit.kapila@huawei.com> wrote: > Consider the case old-master crashed during flushing the data page, now you > would need full page image from new-master. > It might so happen that in new-master Checkpoint would have purged (reused) > the log file's from that time line, in that case > it will be difficult to get the full page image, user can refer WAL archive > for that, but I think it will not be straight forward. Using a WAL archive is standard procedure when you do not wish to take new base backups all the time. This already works now, when archive recovery is set up the WAL will be automatically fetched from the archive. > One more point, what will be the new behavior when there are 2 transactions > one has synchronous_commit =off and other with on? Exactly the same as now - waiting for commit record replication when reporting commit success to the client and waiting for WAL replication before writing a data page are orthogonal features. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Hello,
>The only potential use case for this that I can see, would be for system maintenance and a controlled failover. I agree: that's a major PITA >when doing DR testing, but I personally don't think this is the way to fix that particular edge case.
This is the use case we are trying to address (I work with Samrat). We were wondering why this may not be a good fix?
regards
Sameer
On Thu, Apr 11, 2013 at 3:52 PM, Ants Aasma <ants@cybertec.at> wrote:
On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila <amit.kapila@huawei.com> wrote:Using a WAL archive is standard procedure when you do not wish to take
> Consider the case old-master crashed during flushing the data page, now you
> would need full page image from new-master.
> It might so happen that in new-master Checkpoint would have purged (reused)
> the log file's from that time line, in that case
> it will be difficult to get the full page image, user can refer WAL archive
> for that, but I think it will not be straight forward.
new base backups all the time. This already works now, when archive
recovery is set up the WAL will be automatically fetched from the
archive.Exactly the same as now - waiting for commit record replication when
> One more point, what will be the new behavior when there are 2 transactions
> one has synchronous_commit =off and other with on?
reporting commit success to the client and waiting for WAL replication
before writing a data page are orthogonal features.
Regards,
Ants Aasma
--Cybertec Schönig & Schönig GmbHWeb: http://www.postgresql-support.de
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/11/2013 01:26 PM, Sameer Thakur wrote:
You just shut down the old master and let the standby catchHello,>The only potential use case for this that I can see, would be for system maintenance and a controlled failover. I agree: that's a major PITA >when doing DR testing, but I personally don't think this is the way to fix that particular edge case.This is the use case we are trying to address (I work with Samrat).
up (takas a few microseconds ;) ) before you promote it.
After this you can start up the former master with recovery.conf
and it will follow nicely.
The proposed fix - halting all writes of data pages to disk andWe were wondering why this may not be a good fix?
to WAL files while waiting ACK from standby - will tremendously
slow down all parallel work on master.
And it does just turn around "master is ahead of slave" problem
into "slave is ahead of master" problem :)
Regards
Hannu
regardsSameerOn Thu, Apr 11, 2013 at 3:52 PM, Ants Aasma <ants@cybertec.at> wrote:On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila <amit.kapila@huawei.com> wrote:Using a WAL archive is standard procedure when you do not wish to take
> Consider the case old-master crashed during flushing the data page, now you
> would need full page image from new-master.
> It might so happen that in new-master Checkpoint would have purged (reused)
> the log file's from that time line, in that case
> it will be difficult to get the full page image, user can refer WAL archive
> for that, but I think it will not be straight forward.
new base backups all the time. This already works now, when archive
recovery is set up the WAL will be automatically fetched from the
archive.Exactly the same as now - waiting for commit record replication when
> One more point, what will be the new behavior when there are 2 transactions
> one has synchronous_commit =off and other with on?
reporting commit success to the client and waiting for WAL replication
before writing a data page are orthogonal features.
Regards,
Ants Aasma
--Cybertec Schönig & Schönig GmbHWeb: http://www.postgresql-support.de
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > The proposed fix - halting all writes of data pages to disk and > to WAL files while waiting ACK from standby - will tremendously > slow down all parallel work on master. This is not what is being proposed. The proposed fix halts writes of only data pages that are modified within the window of WAL that is not yet ACKed by the slave. This means pages that were recently modified and where the clocksweep or checkpoint has decided to evict them. This only affects the checkpointer, bgwriter and backends doing allocation. Furthermore, for the backend clocksweep case it would be reasonable to just pick another buffer to evict. The slowdown for most actual cases will be negligible. > And it does just turn around "master is ahead of slave" problem > into "slave is ahead of master" problem :) The issue is not being ahead or behind. The issue is ensuring WAL durability in the face of failovers before modifying data pages. This is sufficient to guarantee no forks in the WAL stream from the point of view of data files and with that the capability to always recover by replaying WAL. There can still be forks from the point of view of async commits, with most recent commits disappearing on failover, but this is in no way different from what we have now. I don't share the view that the disk image is extremely likely to be corrupt after a crash. If that were the case then we should recommend that people don't use crash recovery at all and always restore from a backup. For errors like power supply failure, uncorrectable ECC errors, etc. we can be pretty sure that the server was not writing garbage into the storage system before failing. Having to do a day long rsync run + recovery to catch up on all changes during the resync to restore high-availability safety in those circumstances is in many cases a larger risk. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Ants Aasma <ants@cybertec.at> writes: > On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> The proposed fix - halting all writes of data pages to disk and >> to WAL files while waiting ACK from standby - will tremendously >> slow down all parallel work on master. > This is not what is being proposed. The claim that it won't slow things down is lacking in credible evidence. Moreover, whether there's any performance hit or not, it will certainly add complexity (and risk of bugs) in a place where we can ill afford more. And the bottom line remains: this is optimizing for the wrong thing. regards, tom lane
On 04/11/2013 03:52 PM, Ants Aasma wrote: > On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> The proposed fix - halting all writes of data pages to disk and >> to WAL files while waiting ACK from standby - will tremendously >> slow down all parallel work on master. > This is not what is being proposed. The proposed fix halts writes of > only data pages that are modified within the window of WAL that is not > yet ACKed by the slave. This means pages that were recently modified > and where the clocksweep or checkpoint has decided to evict them. This > only affects the checkpointer, bgwriter and backends doing allocation. > Furthermore, for the backend clocksweep case it would be reasonable to > just pick another buffer to evict. The slowdown for most actual cases > will be negligible. You also need to hold back all WAL writes, including the ones by parallel async and locally-synced transactions. Which means that you have to make all locally synced transactions to wait on the syncrep transactions committed before them. After getting the ACK from slave you then have a backlog of stuff to write locally, which then also needs to be sent to slave. Basically this turns a nice smooth WAL write-and-stream pipeline into a chunky wait-and-write-and-wait-and-stream-and-wait :P This may not be a problem in slight write load cases, which is probably the most widely happening usecase for postgres, but it will harm top performance and also force people to get much better (and more expensive) hardware than would otherways be needed. > >> And it does just turn around "master is ahead of slave" problem >> into "slave is ahead of master" problem :) > The issue is not being ahead or behind. The issue is ensuring WAL > durability in the face of failovers before modifying data pages. This > is sufficient to guarantee no forks in the WAL stream from the point > of view of data files and with that the capability to always recover > by replaying WAL. How would this handle the case Tom pointed out, namely a short power recycling on master ? Instead of just continuing after booting up again the master now has to figure out if it had any slaves and then try to query them (for how long?) if they had any replayed WAL the master does not know of. Suddenly the pure existence of streaming replica slaves has become a problem for master ! This will especially complicate the case of multiple slaves each having received WAL to a slightly different LSN ? And you do want to have at least 2 slaves if you want both durability and availability with syncrep. What if the one of slaves disconnects ? how should master react to this ? Regards Hannu Krosing
On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > On 04/11/2013 03:52 PM, Ants Aasma wrote: >> >> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> >> wrote: >>> >>> The proposed fix - halting all writes of data pages to disk and >>> to WAL files while waiting ACK from standby - will tremendously >>> slow down all parallel work on master. >> >> This is not what is being proposed. The proposed fix halts writes of >> only data pages that are modified within the window of WAL that is not >> yet ACKed by the slave. This means pages that were recently modified >> and where the clocksweep or checkpoint has decided to evict them. This >> only affects the checkpointer, bgwriter and backends doing allocation. >> Furthermore, for the backend clocksweep case it would be reasonable to >> just pick another buffer to evict. The slowdown for most actual cases >> will be negligible. > > You also need to hold back all WAL writes, including the ones by > parallel async and locally-synced transactions. Which means that > you have to make all locally synced transactions to wait on the > syncrep transactions committed before them. > After getting the ACK from slave you then have a backlog of stuff > to write locally, which then also needs to be sent to slave. Basically > this turns a nice smooth WAL write-and-stream pipeline into a > chunky wait-and-write-and-wait-and-stream-and-wait :P > This may not be a problem in slight write load cases, which is > probably the most widely happening usecase for postgres, but it > will harm top performance and also force people to get much > better (and more expensive) hardware than would otherways > be needed. Why would you need to hold back WAL writes? WAL is written on master first and then steamed to slave as it is done now. You would only need hold back dirty page evictions having a recent enough LSN to not yet be replicated. This holding back is already done to wait for local WAL flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied it's usage count gets bumped, so it will not be considered for eviction for at least one clocksweep cycle. In normal circumstances that will be enough time to get an ACK from the slave. When WAL is generated at an higher rate than can be replicated this will not be true. In that case backends that need to bring in new pages will have to wait for WAL to be replicated before they can continue. That will hopefully include the backends that are doing the dirtying, throttling the WAL generation rate. This would definitely be optional behavior, not something turned on by default. >> >>> And it does just turn around "master is ahead of slave" problem >>> into "slave is ahead of master" problem :) >> >> The issue is not being ahead or behind. The issue is ensuring WAL >> durability in the face of failovers before modifying data pages. This >> is sufficient to guarantee no forks in the WAL stream from the point >> of view of data files and with that the capability to always recover >> by replaying WAL. > > How would this handle the case Tom pointed out, namely a short > power recycling on master ? > > Instead of just continuing after booting up again the master now > has to figure out if it had any slaves and then try to query them > (for how long?) if they had any replayed WAL the master does > not know of. If the master is restarted and there is no failover to the slave, then nothing strange would happen, master does recovery, comes up and starts streaming to the slave again. If there is a failover, then whatever is managing the failover needs to ensure that the master does not come up again on its own before it is reconfigured as a slave. This is what HA cluster managers do. > Suddenly the pure existence of streaming replica slaves has become > a problem for master ! > > This will especially complicate the case of multiple slaves each > having received WAL to a slightly different LSN ? And you do want > to have at least 2 slaves if you want both durability > and availability with syncrep. > > What if the one of slaves disconnects ? how should master react to this ? Again, WAL replication will be the same as it is now. Availability considerations, including what to do when slaves go away, are the same as for current sync replication. Only required change is that we can configure the master to hold out on writing any data pages that contain changes that might go missing in the case of a failover. Whether the additional complexity is worth the feature is a matter of opinion. As we have no patch yet I can't say that I know what all the implications are, but at first glance the complexity seems rather compartmentalized. This would only amend what the concept of a WAL flush considers safely flushed. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Thu, Apr 11, 2013 at 2:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ants Aasma <ants@cybertec.at> writes: >> We already rely on WAL-before-data to ensure correct recovery. What is >> proposed here is to slightly redefine it to require WAL to be >> replicated before it is considered to be flushed. This ensures that no >> data page on disk differs from the WAL that the slave has. The >> machinery to do this is already mostly there, we already wait for WAL >> flushes and we know the write location on the slave. The second >> requirement is that we never start up as master and we don't trust any >> local WAL. This is actually how pacemaker clusters work, you would >> only need to amend the RA to wipe the WAL and configure postgresql >> with restart_after_crash = false. > >> It would be very helpful in restoring HA capability after failover if >> we wouldn't have to read through the whole database after a VM goes >> down and is migrated with the shared disk onto a new host. > > The problem with this is it's making an idealistic assumption that a > crashed master didn't do anything wrong or lose/corrupt any data during > its crash. As soon as you realize that's an unsafe assumption, the > whole thing becomes worthless to you. The crash recovery relies on the same assumption. If it's really unsafe, we should stop supporting the crash recovery. But I don't think that's always true. Regards, -- Fujii Masao
On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > You just shut down the old master and let the standby catch > up (takas a few microseconds ;) ) before you promote it. > > After this you can start up the former master with recovery.conf > and it will follow nicely. No. When you shut down the old master, it might not have been able to send all the WAL records to the standby. I have observed this situation several times. So in your approach, new standby might fail to catch up with the master nicely. Regards, -- Fujii Masao
On Fri, Apr 12, 2013 at 12:09 AM, Ants Aasma <ants@cybertec.at> wrote: > On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> On 04/11/2013 03:52 PM, Ants Aasma wrote: >>> >>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> >>> wrote: >>>> >>>> The proposed fix - halting all writes of data pages to disk and >>>> to WAL files while waiting ACK from standby - will tremendously >>>> slow down all parallel work on master. >>> >>> This is not what is being proposed. The proposed fix halts writes of >>> only data pages that are modified within the window of WAL that is not >>> yet ACKed by the slave. This means pages that were recently modified >>> and where the clocksweep or checkpoint has decided to evict them. This >>> only affects the checkpointer, bgwriter and backends doing allocation. >>> Furthermore, for the backend clocksweep case it would be reasonable to >>> just pick another buffer to evict. The slowdown for most actual cases >>> will be negligible. >> >> You also need to hold back all WAL writes, including the ones by >> parallel async and locally-synced transactions. Which means that >> you have to make all locally synced transactions to wait on the >> syncrep transactions committed before them. >> After getting the ACK from slave you then have a backlog of stuff >> to write locally, which then also needs to be sent to slave. Basically >> this turns a nice smooth WAL write-and-stream pipeline into a >> chunky wait-and-write-and-wait-and-stream-and-wait :P >> This may not be a problem in slight write load cases, which is >> probably the most widely happening usecase for postgres, but it >> will harm top performance and also force people to get much >> better (and more expensive) hardware than would otherways >> be needed. > > Why would you need to hold back WAL writes? WAL is written on master > first and then steamed to slave as it is done now. You would only need > hold back dirty page evictions having a recent enough LSN to not yet > be replicated. This holding back is already done to wait for local WAL > flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied > it's usage count gets bumped, so it will not be considered for > eviction for at least one clocksweep cycle. In normal circumstances > that will be enough time to get an ACK from the slave. When WAL is > generated at an higher rate than can be replicated this will not be > true. In that case backends that need to bring in new pages will have > to wait for WAL to be replicated before they can continue. That will > hopefully include the backends that are doing the dirtying, throttling > the WAL generation rate. This would definitely be optional behavior, > not something turned on by default. > >>> >>>> And it does just turn around "master is ahead of slave" problem >>>> into "slave is ahead of master" problem :) >>> >>> The issue is not being ahead or behind. The issue is ensuring WAL >>> durability in the face of failovers before modifying data pages. This >>> is sufficient to guarantee no forks in the WAL stream from the point >>> of view of data files and with that the capability to always recover >>> by replaying WAL. >> >> How would this handle the case Tom pointed out, namely a short >> power recycling on master ? >> >> Instead of just continuing after booting up again the master now >> has to figure out if it had any slaves and then try to query them >> (for how long?) if they had any replayed WAL the master does >> not know of. > > If the master is restarted and there is no failover to the slave, then > nothing strange would happen, master does recovery, comes up and > starts streaming to the slave again. If there is a failover, then > whatever is managing the failover needs to ensure that the master does > not come up again on its own before it is reconfigured as a slave. > This is what HA cluster managers do. > >> Suddenly the pure existence of streaming replica slaves has become >> a problem for master ! >> >> This will especially complicate the case of multiple slaves each >> having received WAL to a slightly different LSN ? And you do want >> to have at least 2 slaves if you want both durability >> and availability with syncrep. >> >> What if the one of slaves disconnects ? how should master react to this ? > > Again, WAL replication will be the same as it is now. Availability > considerations, including what to do when slaves go away, are the same > as for current sync replication. Only required change is that we can > configure the master to hold out on writing any data pages that > contain changes that might go missing in the case of a failover. > > Whether the additional complexity is worth the feature is a matter of > opinion. As we have no patch yet I can't say that I know what all the > implications are, but at first glance the complexity seems rather > compartmentalized. This would only amend what the concept of a WAL > flush considers safely flushed. I really share the same view with you! Regards, -- Fujii Masao
On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma <ants@cybertec.at> wrote:
Why would you need to hold back WAL writes? WAL is written on masterOn Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On 04/11/2013 03:52 PM, Ants Aasma wrote:
>>
>> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com>
>> wrote:
>>>
>>> The proposed fix - halting all writes of data pages to disk and
>>> to WAL files while waiting ACK from standby - will tremendously
>>> slow down all parallel work on master.
>>
>> This is not what is being proposed. The proposed fix halts writes of
>> only data pages that are modified within the window of WAL that is not
>> yet ACKed by the slave. This means pages that were recently modified
>> and where the clocksweep or checkpoint has decided to evict them. This
>> only affects the checkpointer, bgwriter and backends doing allocation.
>> Furthermore, for the backend clocksweep case it would be reasonable to
>> just pick another buffer to evict. The slowdown for most actual cases
>> will be negligible.
>
> You also need to hold back all WAL writes, including the ones by
> parallel async and locally-synced transactions. Which means that
> you have to make all locally synced transactions to wait on the
> syncrep transactions committed before them.
> After getting the ACK from slave you then have a backlog of stuff
> to write locally, which then also needs to be sent to slave. Basically
> this turns a nice smooth WAL write-and-stream pipeline into a
> chunky wait-and-write-and-wait-and-stream-and-wait :P
> This may not be a problem in slight write load cases, which is
> probably the most widely happening usecase for postgres, but it
> will harm top performance and also force people to get much
> better (and more expensive) hardware than would otherways
> be needed.
first and then steamed to slave as it is done now. You would only need
hold back dirty page evictions having a recent enough LSN to not yet
be replicated. This holding back is already done to wait for local WAL
flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
it's usage count gets bumped, so it will not be considered for
eviction for at least one clocksweep cycle. In normal circumstances
that will be enough time to get an ACK from the slave. When WAL is
generated at an higher rate than can be replicated this will not be
true. In that case backends that need to bring in new pages will have
to wait for WAL to be replicated before they can continue. That will
hopefully include the backends that are doing the dirtying, throttling
the WAL generation rate. This would definitely be optional behavior,
not something turned on by default.
I agree. I don't think the proposes change would cause a lot of performance bottleneck since the proposal is to hold back writing of dirty pages until the WAL is replicated successfully to the standby. The heap pages are mostly written by the background threads often much later than the WAL for the change is written. So in all likelihood, there will be no wait involved. Of course, this will not be true for very frequently updated pages that must be written at a checkpoint.
But I wonder if the problem is really limited to the heap pages ? Even for something like a CLOG page, we will need to ensure that the WAL records are replayed before the page is written to the disk. Same is true for relation truncation. In fact, all places where the master needs to call XLogFlush() probably needs to be examined to decide if the subsequent action has a chance to leave the database corrupt and ensure that the WAL is replicated before proceeding with the change.
Tom has a very valid concern from the additional code complexity point of view though I disagree that its always good idea to start with a fresh rsync. If we can avoid that with right checks, I don't see why we should not improve the downtime for the master. Its very likely that the standby may not be as good a server as the master is and the user would want to quickly switch back to the master for performance reasons. To reduce complexity, can we do this as some sort of plugin for XLogFlush() which gets to know that XLogFlush has been upto the given LSN and the event that caused the function to be called ? We can then leave the handling of the even to the implementer. This will also avoid any penalty for those who are happy with the current mechanism and do not want any complex HA setups.
Thanks,
Pavan
On 04/11/2013 07:29 PM, Fujii Masao wrote: > On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> You just shut down the old master and let the standby catch >> up (takas a few microseconds ;) ) before you promote it. >> >> After this you can start up the former master with recovery.conf >> and it will follow nicely. > No. When you shut down the old master, it might not have been > able to send all the WAL records to the standby. In what cases (other than a standby lagging too much or not listening at all) have you observed this ? > I have observed > this situation several times. So in your approach, new standby > might fail to catch up with the master nicely. the page http://wiki.postgresql.org/wiki/Streaming_Replication claims this: * Graceful shutdown When smart/fast shutdown is requested, the primary waits to exit until XLOG records have been sent to the standby, up to the shutdown checkpoint record. Maybe you were requesting immediate shutdown ? Regards Hannu Krosing
On 2013-04-12 02:29:01 +0900, Fujii Masao wrote: > On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > > > You just shut down the old master and let the standby catch > > up (takas a few microseconds ;) ) before you promote it. > > > > After this you can start up the former master with recovery.conf > > and it will follow nicely. > > No. When you shut down the old master, it might not have been > able to send all the WAL records to the standby. I have observed > this situation several times. So in your approach, new standby > might fail to catch up with the master nicely. It seems most of this thread is focusing on the wrong thing then. If we really are only talking about planned failover then we need to solve *that* not some ominous "don't flush data too early" which has noticeable performance and implementation complexity problems. I guess youre observing that not everything is replicated because youre doing an immediate shutdown - probably because performing the shutdown checkpoint would take too long. This seems solveable by implementing a recovery connection command which initiates a shutdown that just disables future WAL inserts and returns the last lsn that has been written. Then you can fall over as soon as that llsn has been reached and can make the previous master follow from there on without problems. You could even teach the standby not to increment the timeline in that case since thats safe. The biggest issue seems to be how to implement this without another spinlock acquisition for every XLogInsert(), but that seems possible. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-04-12 11:18:01 +0530, Pavan Deolasee wrote: > On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma <ants@cybertec.at> wrote: > > > On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing <hannu@2ndquadrant.com> > > wrote: > > > On 04/11/2013 03:52 PM, Ants Aasma wrote: > > >> > > >> On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing <hannu@2ndquadrant.com> > > >> wrote: > > >>> > > >>> The proposed fix - halting all writes of data pages to disk and > > >>> to WAL files while waiting ACK from standby - will tremendously > > >>> slow down all parallel work on master. > > >> > > >> This is not what is being proposed. The proposed fix halts writes of > > >> only data pages that are modified within the window of WAL that is not > > >> yet ACKed by the slave. This means pages that were recently modified > > >> and where the clocksweep or checkpoint has decided to evict them. This > > >> only affects the checkpointer, bgwriter and backends doing allocation. > > >> Furthermore, for the backend clocksweep case it would be reasonable to > > >> just pick another buffer to evict. The slowdown for most actual cases > > >> will be negligible. > > > > > > You also need to hold back all WAL writes, including the ones by > > > parallel async and locally-synced transactions. Which means that > > > you have to make all locally synced transactions to wait on the > > > syncrep transactions committed before them. > > > After getting the ACK from slave you then have a backlog of stuff > > > to write locally, which then also needs to be sent to slave. Basically > > > this turns a nice smooth WAL write-and-stream pipeline into a > > > chunky wait-and-write-and-wait-and-stream-and-wait :P > > > This may not be a problem in slight write load cases, which is > > > probably the most widely happening usecase for postgres, but it > > > will harm top performance and also force people to get much > > > better (and more expensive) hardware than would otherways > > > be needed. > > > > Why would you need to hold back WAL writes? WAL is written on master > > first and then steamed to slave as it is done now. You would only need > > hold back dirty page evictions having a recent enough LSN to not yet > > be replicated. This holding back is already done to wait for local WAL > > flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied > > it's usage count gets bumped, so it will not be considered for > > eviction for at least one clocksweep cycle. In normal circumstances > > that will be enough time to get an ACK from the slave. When WAL is > > generated at an higher rate than can be replicated this will not be > > true. In that case backends that need to bring in new pages will have > > to wait for WAL to be replicated before they can continue. That will > > hopefully include the backends that are doing the dirtying, throttling > > the WAL generation rate. This would definitely be optional behavior, > > not something turned on by default. > > > > > I agree. I don't think the proposes change would cause a lot of performance > bottleneck since the proposal is to hold back writing of dirty pages until > the WAL is replicated successfully to the standby. The heap pages are > mostly written by the background threads often much later than the WAL for > the change is written. So in all likelihood, there will be no wait > involved. Of course, this will not be true for very frequently updated > pages that must be written at a checkpoint. I don't think that holds true at all. If you look at pg_stat_bgwriter in any remotely bugs cluster with a hot data set over shared_buffers you'll notice that a large percentage of writes will have been done by backends themselves. Yes, we need to improve on this, and we are talking about it right now in another thread, but until thats solved this argumentation seems to fall flat on its face. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- I don't think that holds true at all. If you look at pg_stat_bgwriter in
any remotely bugs cluster with a hot data set over shared_buffers you'll
notice that a large percentage of writes will have been done by backends
themselves.
Even if what you are saying it true, which I am sure is, the pages that the backend is evicting mustn't be recently used by the LRU algorithm which means that the WAL pertaining to the last change to the page in most likelihood is already replicated, unless the replication is really lagging much behind. Of course, if the standby is not able to keep pace with the master in a realistic manner then we have a problem with the approach.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 2013-04-12 16:58:44 +0530, Pavan Deolasee wrote: > On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > > > > > I don't think that holds true at all. If you look at pg_stat_bgwriter in > > any remotely bugs cluster with a hot data set over shared_buffers you'll > > notice that a large percentage of writes will have been done by backends > > themselves. > > > > Even if what you are saying it true, which I am sure is, the pages that the > backend is evicting mustn't be recently used by the LRU algorithm which > means that the WAL pertaining to the last change to the page in most > likelihood is already replicated, unless the replication is really lagging > much behind. Of course, if the standby is not able to keep pace with the > master in a realistic manner then we have a problem with the approach. It frequently takes time in the sub to few second range for usagecounts in zero. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 12, 2013 at 5:53 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > On 04/11/2013 07:29 PM, Fujii Masao wrote: >> >> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> >> wrote: >>> >>> You just shut down the old master and let the standby catch >>> up (takas a few microseconds ;) ) before you promote it. >>> >>> After this you can start up the former master with recovery.conf >>> and it will follow nicely. >> >> No. When you shut down the old master, it might not have been >> able to send all the WAL records to the standby. > > In what cases (other than a standby lagging too much or > not listening at all) have you observed this ? > >> I have observed >> this situation several times. So in your approach, new standby >> might fail to catch up with the master nicely. > > the page http://wiki.postgresql.org/wiki/Streaming_Replication claims this: > > * Graceful shutdown > > When smart/fast shutdown is requested, the primary waits to exit > until XLOG records have been sent to the standby, up to the > shutdown checkpoint record. > > Maybe you were requesting immediate shutdown ? No. I did fast shutdown. It's true that the master waits for checkpoint record to be replicated to the standby when fast shutdown is performed. But the standby can not always successfully receive all WAL records which the master sent. To ensure that all WAL records have been replicated to the standby at fast shutdown, we should make the walsender wait for the standby to write the checkpoint record and send back the ACK. Regards, -- Fujii Masao
On Fri, Apr 12, 2013 at 7:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-04-12 02:29:01 +0900, Fujii Masao wrote: >> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> > >> > You just shut down the old master and let the standby catch >> > up (takas a few microseconds ;) ) before you promote it. >> > >> > After this you can start up the former master with recovery.conf >> > and it will follow nicely. >> >> No. When you shut down the old master, it might not have been >> able to send all the WAL records to the standby. I have observed >> this situation several times. So in your approach, new standby >> might fail to catch up with the master nicely. > > It seems most of this thread is focusing on the wrong thing then. If we > really are only talking about planned failover then we need to solve > *that* not some ominous "don't flush data too early" which has > noticeable performance and implementation complexity problems. At least I'd like to talk about not only planned failover but also normal failover. > I guess youre observing that not everything is replicated because youre > doing an immediate shutdown No. I did fast shutdown. At fast shutdown, after walsender sends the checkpoint record and closes the replication connection, walreceiver can detect the close of connection before receiving all WAL records. This means that, even if walsender sends all WAL records, walreceiver cannot always receive all of them. > You could even teach the standby not to increment the timeline in that > case since thats safe. I don't think this is required thanks to recent Heikki's great efforts about timelines. Regards, -- Fujii Masao
On 04/14/2013 05:56 PM, Fujii Masao wrote: > On Fri, Apr 12, 2013 at 7:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-04-12 02:29:01 +0900, Fujii Masao wrote: >>> On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >>>> You just shut down the old master and let the standby catch >>>> up (takas a few microseconds ;) ) before you promote it. >>>> >>>> After this you can start up the former master with recovery.conf >>>> and it will follow nicely. >>> No. When you shut down the old master, it might not have been >>> able to send all the WAL records to the standby. I have observed >>> this situation several times. So in your approach, new standby >>> might fail to catch up with the master nicely. >> It seems most of this thread is focusing on the wrong thing then. If we >> really are only talking about planned failover then we need to solve >> *that* not some ominous "don't flush data too early" which has >> noticeable performance and implementation complexity problems. > At least I'd like to talk about not only planned failover but also normal > failover. > >> I guess youre observing that not everything is replicated because youre >> doing an immediate shutdown > No. I did fast shutdown. > > At fast shutdown, after walsender sends the checkpoint record and > closes the replication connection, walreceiver can detect the close > of connection before receiving all WAL records. This means that, > even if walsender sends all WAL records, walreceiver cannot always > receive all of them. Seems very much like a bug, or at least a missing mode - synchronous shutdown - where the master will wait for ack from standby(s) before closing client connection. > >> You could even teach the standby not to increment the timeline in that >> case since thats safe. > I don't think this is required thanks to recent Heikki's great efforts about > timelines. > > Regards, >
On Apr14, 2013, at 17:56 , Fujii Masao <masao.fujii@gmail.com> wrote: > At fast shutdown, after walsender sends the checkpoint record and > closes the replication connection, walreceiver can detect the close > of connection before receiving all WAL records. This means that, > even if walsender sends all WAL records, walreceiver cannot always > receive all of them. That sounds like a bug in walreceiver to me. The following code in walreceiver's main loop looks suspicious: /* * Process the received data, and any subsequent data we * can read without blocking. */ for (;;) { if (len > 0) { /* Something was received from master, so reset timeout */ ... XLogWalRcvProcessMsg(buf[0], &buf[1], len- 1); } else if (len == 0) break; else if (len < 0) { ereport(LOG, (errmsg("replication terminatedby primary server"), errdetail("End of WAL reached on timeline %u at %X/%X", startpointTLI, (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write))); ... } len = walrcv_receive(0, &buf); } /* Let the master know that we received some data. */ XLogWalRcvSendReply(false, false); /* * If we've written some records, flush them to disk and * let the startup process and primary server know about *them. */ XLogWalRcvFlush(false); The loop at the top looks fine - it specifically avoids throwing an error on EOF. But the code then proceeds to XLogWalRcvSendReply() which doesn't seem to have the same smarts - it simply does if (PQputCopyData(streamConn, buffer, nbytes) <= 0 || PQflush(streamConn)) ereport(ERROR, (errmsg("couldnot send data to WAL stream: %s", PQerrorMessage(streamConn)))); Unless I'm missing something, that certainly seems to explain how a standby can lag behind even after a controlled shutdown of the master. best regards, Florian Pflug
On Monday, April 15, 2013 1:02 PM Florian Pflug wrote: > On Apr14, 2013, at 17:56 , Fujii Masao <masao.fujii@gmail.com> wrote: > > At fast shutdown, after walsender sends the checkpoint record and > > closes the replication connection, walreceiver can detect the close > > of connection before receiving all WAL records. This means that, > > even if walsender sends all WAL records, walreceiver cannot always > > receive all of them. > > That sounds like a bug in walreceiver to me. > > The following code in walreceiver's main loop looks suspicious: > > /* > * Process the received data, and any subsequent data we > * can read without blocking. > */ > for (;;) > { > if (len > 0) > { > /* Something was received from master, so reset timeout */ > ... > XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); > } > else if (len == 0) > break; > else if (len < 0) > { > ereport(LOG, > (errmsg("replication terminated by primary server"), > errdetail("End of WAL reached on timeline %u at %X/%X", > startpointTLI, > (uint32) (LogstreamResult.Write >> 32), > (uint32) LogstreamResult.Write))); > ... > } > len = walrcv_receive(0, &buf); > } > > /* Let the master know that we received some data. */ > XLogWalRcvSendReply(false, false); > > /* > * If we've written some records, flush them to disk and > * let the startup process and primary server know about > * them. > */ > XLogWalRcvFlush(false); > > The loop at the top looks fine - it specifically avoids throwing > an error on EOF. But the code then proceeds to XLogWalRcvSendReply() > which doesn't seem to have the same smarts - it simply does > > if (PQputCopyData(streamConn, buffer, nbytes) <= 0 || > PQflush(streamConn)) > ereport(ERROR, > (errmsg("could not send data to WAL stream: %s", > PQerrorMessage(streamConn)))); > > Unless I'm missing something, that certainly seems to explain > how a standby can lag behind even after a controlled shutdown of > the master. Do you mean to say that as an error has occurred, so it would not be able to flush received WAL, which could result in loss of WAL? I think even if error occurs, it will call flush in WalRcvDie(), before terminating WALReceiver. With Regards, Amit Kapila.
On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote: > Do you mean to say that as an error has occurred, so it would not be able to > flush received WAL, which could result in loss of WAL? > I think even if error occurs, it will call flush in WalRcvDie(), before > terminating WALReceiver. Hm, true, but for that to prevent the problem the inner processing loop needs to always read up to EOF before it exits and we attempt to send a reply. Which I don't think it necessarily does. Assume, that the master sends a chunk of data, waits a bit, and finally sends the shutdown record and exits. The slave might then receive the first chunk, and it might trigger sending a reply. At the time the reply is sent, the master has already sent the shutdown record and closed the connection, and we'll thus fail to reply and abort. Since the shutdown record has never been read from the socket, XLogWalRcvFlush won't flush it, and the slave ends up behind the master. Also, since XLogWalRcvProcessMsg responds to keep-alives messages, we might also error out of the inner processing loop if the server closes the socket after sending a keepalive but before we attempt to respond. Fixing this on the receive side alone seems quite messy and fragile. So instead, I think we should let the master send a shutdown message after it has sent everything it wants to send, and wait for the client to acknowledge it before shutting down the socket. If the client fails to respond, we could log a fat WARNING. best regards, Florian Pflug
On Wednesday, April 17, 2013 4:19 PM Florian Pflug wrote: > On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote: > > Do you mean to say that as an error has occurred, so it would not be > able to > > flush received WAL, which could result in loss of WAL? > > I think even if error occurs, it will call flush in WalRcvDie(), > before > > terminating WALReceiver. > > Hm, true, but for that to prevent the problem the inner processing > loop needs to always read up to EOF before it exits and we attempt > to send a reply. Which I don't think it necessarily does. Assume, > that the master sends a chunk of data, waits a bit, and finally > sends the shutdown record and exits. The slave might then receive > the first chunk, and it might trigger sending a reply. At the time > the reply is sent, the master has already sent the shutdown record > and closed the connection, and we'll thus fail to reply and abort. > Since the shutdown record has never been read from the socket, > XLogWalRcvFlush won't flush it, and the slave ends up behind the > master. > > Also, since XLogWalRcvProcessMsg responds to keep-alives messages, > we might also error out of the inner processing loop if the server > closes the socket after sending a keepalive but before we attempt > to respond. > > Fixing this on the receive side alone seems quite messy and fragile. > So instead, I think we should let the master send a shutdown message > after it has sent everything it wants to send, and wait for the client > to acknowledge it before shutting down the socket. > > If the client fails to respond, we could log a fat WARNING. Your explanation seems to be okay, but I think before discussing the exact solution, If the actual problem can be reproduced, then it might be better to discuss this solution. With Regards, Amit Kapila.
On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote: > Fixing this on the receive side alone seems quite messy and fragile. > So instead, I think we should let the master send a shutdown message > after it has sent everything it wants to send, and wait for the client > to acknowledge it before shutting down the socket. > > If the client fails to respond, we could log a fat WARNING. ISTM the master should half close the socket, using shutdown(). That way the client receives an EOF and can still then send its reply to the master. Then when the master receives that it can close() completely. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Apr19, 2013, at 14:46 , Martijn van Oosterhout <kleptog@svana.org> wrote: > On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote: >> Fixing this on the receive side alone seems quite messy and fragile. >> So instead, I think we should let the master send a shutdown message >> after it has sent everything it wants to send, and wait for the client >> to acknowledge it before shutting down the socket. >> >> If the client fails to respond, we could log a fat WARNING. > > ISTM the master should half close the socket, using shutdown(). That > way the client receives an EOF and can still then send its reply to the > master. Then when the master receives that it can close() completely. Hm, there may be arbitrarily many reply requests within the unread data in the socket's buffer, so wait for just one reply won't work. Also, to distinguish a slave which crashes while the master shuts down from one that has received all WAL and flushed it, the slave should flush all WAL and send a final reply before closing the socket. So the master would, upon shutting down, close only its writing end of the connection, and continue to receive replies until it sees EOF. After all slaves have gone, the master would emit a WARNING for every slave whose last logged flush position is earlier than the master's idea of end-of-wal. The slave would, upon seeing EOF, flush all its WAL, send a final reply, and close() the socket. I'm not sure that relying on TCP's half-close feature has much benefit over using a home-grown shutdown message, though. Anyway, the basic shutdown protocol would be the same regardless of what exactly we use to signal a shutdown. BTW, I assume we'd only do this for smart shutdowns. best regards, Florian Pflug
On Wed, Apr 17, 2013 at 10:11 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Wednesday, April 17, 2013 4:19 PM Florian Pflug wrote: >> On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote: >> > Do you mean to say that as an error has occurred, so it would not be >> able to >> > flush received WAL, which could result in loss of WAL? >> > I think even if error occurs, it will call flush in WalRcvDie(), >> before >> > terminating WALReceiver. >> >> Hm, true, but for that to prevent the problem the inner processing >> loop needs to always read up to EOF before it exits and we attempt >> to send a reply. Which I don't think it necessarily does. Assume, >> that the master sends a chunk of data, waits a bit, and finally >> sends the shutdown record and exits. The slave might then receive >> the first chunk, and it might trigger sending a reply. At the time >> the reply is sent, the master has already sent the shutdown record >> and closed the connection, and we'll thus fail to reply and abort. >> Since the shutdown record has never been read from the socket, >> XLogWalRcvFlush won't flush it, and the slave ends up behind the >> master. >> >> Also, since XLogWalRcvProcessMsg responds to keep-alives messages, >> we might also error out of the inner processing loop if the server >> closes the socket after sending a keepalive but before we attempt >> to respond. >> >> Fixing this on the receive side alone seems quite messy and fragile. >> So instead, I think we should let the master send a shutdown message >> after it has sent everything it wants to send, and wait for the client >> to acknowledge it before shutting down the socket. >> >> If the client fails to respond, we could log a fat WARNING. > > Your explanation seems to be okay, but I think before discussing the exact > solution, > If the actual problem can be reproduced, then it might be better to discuss > this solution. I got this problem several times when I enabled WAL archiving and shut down the master. Regards, -- Fujii Masao
On Wed, Apr 17, 2013 at 7:49 PM, Florian Pflug <fgp@phlo.org> wrote: > On Apr17, 2013, at 12:22 , Amit Kapila <amit.kapila@huawei.com> wrote: >> Do you mean to say that as an error has occurred, so it would not be able to >> flush received WAL, which could result in loss of WAL? >> I think even if error occurs, it will call flush in WalRcvDie(), before >> terminating WALReceiver. > > Hm, true, but for that to prevent the problem the inner processing > loop needs to always read up to EOF before it exits and we attempt > to send a reply. Which I don't think it necessarily does. Assume, > that the master sends a chunk of data, waits a bit, and finally > sends the shutdown record and exits. The slave might then receive > the first chunk, and it might trigger sending a reply. At the time > the reply is sent, the master has already sent the shutdown record > and closed the connection, and we'll thus fail to reply and abort. > Since the shutdown record has never been read from the socket, > XLogWalRcvFlush won't flush it, and the slave ends up behind the > master. > > Also, since XLogWalRcvProcessMsg responds to keep-alives messages, > we might also error out of the inner processing loop if the server > closes the socket after sending a keepalive but before we attempt > to respond. > > Fixing this on the receive side alone seems quite messy and fragile. > So instead, I think we should let the master send a shutdown message > after it has sent everything it wants to send, and wait for the client > to acknowledge it before shutting down the socket. Agreed. I've tried to fix this problem on only the walreceiver side, but that failed. I agree that we should change walsender so that it waits for the replay from the standby before closing the connection. Regards, -- Fujii Masao