Re: Avoiding shutdown checkpoint at failover - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: Avoiding shutdown checkpoint at failover |
Date | |
Msg-id | CAHGQGwH2rOZjMa_-iPCB=X6=5LbLxSf45o5SSR04YDkJccDz8g@mail.gmail.com Whole thread Raw |
In response to | Re: Avoiding shutdown checkpoint at failover (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: Avoiding shutdown checkpoint at failover
|
List | pgsql-hackers |
On Fri, Jan 20, 2012 at 12:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>>> When I say skip the shutdown checkpoint, I mean remove it from the >>>> critical path of required actions at the end of recovery. We can still >>>> have a normal checkpoint kicked off at that time, but that no longer >>>> needs to be on the critical path. >>>> >>>> Any problems foreseen? If not, looks like a quick patch. >>> >>> Patch attached for discussion/review. >> >> This feature is what I want, and very helpful to shorten the failover time in >> streaming replication. >> >> Here are the review comments. Though I've not checked enough whether >> this feature works fine in all recovery patterns yet. >> >> LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery(). >> LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery(). >> >> XLOG_END_OF_RECOVERY record is written to the WAL file with new >> assigned timeline ID. But it must be written to the WAL file with old one. >> Otherwise, when re-entering a recovery after failover, we cannot find >> XLOG_END_OF_RECOVERY record at all. >> >> Before XLOG_END_OF_RECOVERY record is written, >> RmgrTable[rmid].rm_cleanup() might write WAL records. They also >> should be written to the WAL file with old timeline ID. >> >> When recovery target is specified, we cannot write new WAL to the file >> with old timeline because which means that valid WAL records in it are >> overwritten with new WAL. So when recovery target is specified, >> ISTM that we cannot skip end of recovery checkpoint. Or we might need >> to save all information about timelines in the database cluster instead >> of writing XLOG_END_OF_RECOVERY record, and use it when re-entering >> a recovery. >> >> LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise, >> what if the server crashes after new timeline history file is created and >> recovery.conf is removed, but before XLOG_END_OF_RECOVERY record >> has not been flushed to the disk yet? >> >> During recovery, when we replay XLOG_END_OF_RECOVERY record, we >> should close the currently-opened WAL file and read the WAL file with >> the timeline which XLOG_END_OF_RECOVERY record indicates. >> Otherwise, when re-entering a recovery with old timeline, we cannot >> reach new timeline. > > > > OK, some bad things there, thanks for the insightful comments. > > > > I think you're right that we can't skip the checkpoint if xlog_cleanup > writes WAL records, since that implies at least one and maybe more > blocks have changed and need to be flushed. That can be improved upon, > but not now in 9.2.Cleanup WAL is written in either the old or the new > timeline, depending upon whether we increment it. So we don't need to > change anything there, IMHO. > > The big problem is how we handle crash recovery after we startup > without a checkpoint. No quick fixes there. > > So let me rethink this: The idea was that we can skip the checkpoint > if we promote to normal running during streaming replication. > > WALReceiver has been writing to WAL files, so can write more data > without all of the problems noted. Rather than write the > XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from > the WALreceiver** as a dummy record by direct injection into the WAL > stream. So the Startup process sees a WAL record that looks like it > was written by the primary saying "promote yourself", although it was > actually written locally by WALreceiver when requested to shutdown. > That doesn't damage anything because we know we've received all the > WAL there is. Most importantly we don't need to change any of the > logic in a way that endangers the other code paths at end of recovery. > > Writing the record in that way means we would need to calculate the > new tli slightly earlier, so we can input the correct value into the > record. That also solves the problem of how to get additional standbys > to follow the new master. The XLOG_END_OF_RECOVERY record is simply > the contents of the newly written tli history file. > > If we skip the checkpoint and then crash before the next checkpoint we > just change timeline when we see XLOG_END_OF_RECOVERY. When we replay > the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli > file and then switch to it. > > So this solves 2 problems: having other standbys follow us when they > don't have archiving, and avoids the checkpoint. > > Let me know what you think. Looks good to me. One thing I would like to ask is that why you think walreceiver is more appropriate for writing XLOG_END_OF_RECOVERY record than startup process. I was thinking the opposite, because if we do so, we might be able to skip the end-of-recovery checkpoint even in file-based log-shipping case. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
pgsql-hackers by date: