needless complexity in StartupXLOG - Mailing list pgsql-hackers

From Robert Haas
Subject needless complexity in StartupXLOG
Date
Msg-id CA+TgmoYmw==TOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA@mail.gmail.com
Whole thread Raw
Responses Re: needless complexity in StartupXLOG  (Stephen Frost <sfrost@snowman.net>)
Re: needless complexity in StartupXLOG  (Andres Freund <andres@anarazel.de>)
Re: needless complexity in StartupXLOG  (Amul Sul <sulamul@gmail.com>)
List pgsql-hackers
StartupXLOG() has code beginning around line 7900 of xlog.c that
decides, at the end of recovery, between four possible courses of
action. It either writes an end-of-recovery record, or requests a
checkpoint, or creates a checkpoint, or does nothing, depending on the
value of 3 flag variables, and also on whether we're still able to
read the last checkpoint record:

                checkPointLoc = ControlFile->checkPoint;

                /*
                 * Confirm the last checkpoint is available for us to recover
                 * from if we fail.
                 */
                record = ReadCheckpointRecord(xlogreader,
checkPointLoc, 1, false);
                if (record != NULL)
                {
                    promoted = true;

It seems to me that ReadCheckpointRecord() should never fail here. It
should always be true, even when we're not in recovery, that the last
checkpoint record is readable. If there's ever a situation where that
is not true, even for an instant, then a crash at that point will be
unrecoverable. Now, one way that it could happen is if we've got a bug
in the code someplace that removes WAL segments too soon. However, if
we have such a bug, we should fix it. What this code does is says "oh,
I guess we removed the old checkpoint too soon, no big deal, let's
just be more aggressive about getting the next one done," which I do
not think is the right response. Aside from a bug, the only other way
I can see it happening is if someone is manually removing WAL segments
as the server is running through recovery, perhaps as some last-ditch
play to avoid running out of disk space. I don't think the server
needs to have - or should have - code to cater to such crazy
scenarios. Therefore I think that this check, at least in its current
form, is not sensible.

My first thought was that we should do the check unconditionally,
rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
ERROR if it fails. But then I wondered what the point of that would be
exactly. If we have such a bug -- and to the best of my knowledge
there's no evidence that we do -- there's no particular reason it
should only happen at the end of recovery. It could happen any time
the system -- or the user, or malicious space aliens -- remove files
from pg_wal, and we have no real idea about the timing of malicious
space alien activity, so doing the test here rather than anywhere else
just seems like a shot in the dark. Perhaps the most logical place
would be to move it to CreateCheckPoint() just after we remove old
xlog files, but we don't have the xlogreader there, and anyway I don't
see how it's really going to help. What bug do we expect to catch by
removing files we think we don't need and then checking that we didn't
remove the files we think we do need? That seems more like grasping at
straws than a serious attempt to make things work better.

So at the moment I am leaning toward the view that we should just
remove this check entirely, as in the attached, proposed patch.

Really, I think we should consider going further. If it's safe to
write an end-of-recovery record rather than a checkpoint, why not do
so all the time? Right now we fail to do that in the above-described
"impossible" scenario where the previous checkpoint record can't be
read, or if we're exiting archive recovery for some reason other than
a promotion request, or if we're in single-user mode, or if we're in
crash recovery. Presumably, people would like to start up the server
quickly in all of those scenarios, so the only reason not to use this
technology all the time is if we think it's safe in some scenarios and
not others. I can't think of a reason why it matters why we're exiting
archive recovery, nor can I think of a reason why it matters whether
we're in single user mode. The distinction between crash recovery and
archive recovery does seem to matter, but if anything the crash
recovery scenario seems simpler, because then there's only one
timeline involved.

I realize that conservatism may have played a role in this code ending
up looking the way that it does; someone seems to have thought it
would be better not to rely on a new idea in all cases. From my point
of view, though, it's scary to have so many cases, especially cases
that don't seem like they should ever be reached. I think that
simplifying the logic here and trying to do the same things in as many
cases as we can will lead to better robustness. Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
    CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Inaccurate error message when set fdw batch_size to 0
Next
From: "Bossart, Nathan"
Date:
Subject: Re: .ready and .done files considered harmful