Re: Avoiding shutdown checkpoint at failover - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Avoiding shutdown checkpoint at failover
Date
Msg-id CA+U5nMLJ8ENK-ZaQdukNULT6C_3k1vGFDpfw5fmgeEk48tjRQA@mail.gmail.com
Whole thread Raw
In response to Re: Avoiding shutdown checkpoint at failover  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Avoiding shutdown checkpoint at failover
List pgsql-hackers
On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>> When I say skip the shutdown checkpoint, I mean remove it from the
>>> critical path of required actions at the end of recovery. We can still
>>> have a normal checkpoint kicked off at that time, but that no longer
>>> needs to be on the critical path.
>>>
>>> Any problems foreseen? If not, looks like a quick patch.
>>
>> Patch attached for discussion/review.
>
> This feature is what I want, and very helpful to shorten the failover time in
> streaming replication.
>
> Here are the review comments. Though I've not checked enough whether
> this feature works fine in all recovery patterns yet.
>
> LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery().
> LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery().
>
> XLOG_END_OF_RECOVERY record is written to the WAL file with new
> assigned timeline ID. But it must be written to the WAL file with old one.
> Otherwise, when re-entering a recovery after failover, we cannot find
> XLOG_END_OF_RECOVERY record at all.
>
> Before XLOG_END_OF_RECOVERY record is written,
> RmgrTable[rmid].rm_cleanup() might write WAL records. They also
> should be written to the WAL file with old timeline ID.
>
> When recovery target is specified, we cannot write new WAL to the file
> with old timeline because which means that valid WAL records in it are
> overwritten with new WAL. So when recovery target is specified,
> ISTM that we cannot skip end of recovery checkpoint. Or we might need
> to save all information about timelines in the database cluster instead
> of writing XLOG_END_OF_RECOVERY record, and use it when re-entering
> a recovery.
>
> LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise,
> what if the server crashes after new timeline history file is created and
> recovery.conf is removed, but before XLOG_END_OF_RECOVERY record
> has not been flushed to the disk yet?
>
> During recovery, when we replay XLOG_END_OF_RECOVERY record, we
> should close the currently-opened WAL file and read the WAL file with
> the timeline which XLOG_END_OF_RECOVERY record indicates.
> Otherwise, when re-entering a recovery with old timeline, we cannot
> reach new timeline.



OK, some bad things there, thanks for the insightful comments.



I think you're right that we can't skip the checkpoint if xlog_cleanup
writes WAL records, since that implies at least one and maybe more
blocks have changed and need to be flushed. That can be improved upon,
but not now in 9.2.Cleanup WAL is written in either the old or the new
timeline, depending upon whether we increment it. So we don't need to
change anything there, IMHO.

The big problem is how we handle crash recovery after we startup
without a checkpoint. No quick fixes there.

So let me rethink this: The idea was that we can skip the checkpoint
if we promote to normal running during streaming replication.

WALReceiver has been writing to WAL files, so can write more data
without all of the problems noted. Rather than write the
XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from
the WALreceiver** as a dummy record by direct injection into the WAL
stream. So the Startup process sees a WAL record that looks like it
was written by the primary saying "promote yourself", although it was
actually written locally by WALreceiver when requested to shutdown.
That doesn't damage anything because we know we've received all the
WAL there is. Most importantly we don't need to change any of the
logic in a way that endangers the other code paths at end of recovery.

Writing the record in that way means we would need to calculate the
new tli slightly earlier, so we can input the correct value into the
record. That also solves the problem of how to get additional standbys
to follow the new master. The XLOG_END_OF_RECOVERY record is simply
the contents of the newly written tli history file.

If we skip the checkpoint and then crash before the next checkpoint we
just change timeline when we see XLOG_END_OF_RECOVERY. When we replay
the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli
file and then switch to it.

So this solves 2 problems: having other standbys follow us when they
don't have archiving, and avoids the checkpoint.

Let me know what you think.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Inline Extension
Next
From: Noah Misch
Date:
Subject: Re: Add minor version to v3 protocol to allow changes without breaking backwards compatibility