Re: needless complexity in StartupXLOG - Mailing list pgsql-hackers

From Amul Sul
Subject Re: needless complexity in StartupXLOG
Date
Msg-id CAAJ_b97ayz59x+sLmVuFYOb-Q69LdUDR+nQQcuhki2JB7aL2Pg@mail.gmail.com
Whole thread Raw
In response to needless complexity in StartupXLOG  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: needless complexity in StartupXLOG  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Jul 26, 2021 at 9:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> StartupXLOG() has code beginning around line 7900 of xlog.c that
> decides, at the end of recovery, between four possible courses of
> action. It either writes an end-of-recovery record, or requests a
> checkpoint, or creates a checkpoint, or does nothing, depending on the
> value of 3 flag variables, and also on whether we're still able to
> read the last checkpoint record:
>
>                 checkPointLoc = ControlFile->checkPoint;
>
>                 /*
>                  * Confirm the last checkpoint is available for us to recover
>                  * from if we fail.
>                  */
>                 record = ReadCheckpointRecord(xlogreader,
> checkPointLoc, 1, false);
>                 if (record != NULL)
>                 {
>                     promoted = true;
>
> It seems to me that ReadCheckpointRecord() should never fail here. It
> should always be true, even when we're not in recovery, that the last
> checkpoint record is readable. If there's ever a situation where that
> is not true, even for an instant, then a crash at that point will be
> unrecoverable. Now, one way that it could happen is if we've got a bug
> in the code someplace that removes WAL segments too soon. However, if
> we have such a bug, we should fix it. What this code does is says "oh,
> I guess we removed the old checkpoint too soon, no big deal, let's
> just be more aggressive about getting the next one done," which I do
> not think is the right response. Aside from a bug, the only other way
> I can see it happening is if someone is manually removing WAL segments
> as the server is running through recovery, perhaps as some last-ditch
> play to avoid running out of disk space. I don't think the server
> needs to have - or should have - code to cater to such crazy
> scenarios. Therefore I think that this check, at least in its current
> form, is not sensible.
>
> My first thought was that we should do the check unconditionally,
> rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
> ERROR if it fails. But then I wondered what the point of that would be
> exactly. If we have such a bug -- and to the best of my knowledge
> there's no evidence that we do -- there's no particular reason it
> should only happen at the end of recovery. It could happen any time
> the system -- or the user, or malicious space aliens -- remove files
> from pg_wal, and we have no real idea about the timing of malicious
> space alien activity, so doing the test here rather than anywhere else
> just seems like a shot in the dark. Perhaps the most logical place
> would be to move it to CreateCheckPoint() just after we remove old
> xlog files, but we don't have the xlogreader there, and anyway I don't
> see how it's really going to help. What bug do we expect to catch by
> removing files we think we don't need and then checking that we didn't
> remove the files we think we do need? That seems more like grasping at
> straws than a serious attempt to make things work better.
>
> So at the moment I am leaning toward the view that we should just
> remove this check entirely, as in the attached, proposed patch.
>

Can we have an elog() fatal error or warning to make sure that the
last checkpoint is still readable? Since the case where the user
(knowingly or unknowingly) or some buggy code has removed the WAL file
containing the last checkpoint could be possible. If it is then we
would have a hard time finding out when we get further unexpected
behavior due to this. Thoughts?

> Really, I think we should consider going further. If it's safe to
> write an end-of-recovery record rather than a checkpoint, why not do
> so all the time? Right now we fail to do that in the above-described
> "impossible" scenario where the previous checkpoint record can't be
> read, or if we're exiting archive recovery for some reason other than
> a promotion request, or if we're in single-user mode, or if we're in
> crash recovery. Presumably, people would like to start up the server
> quickly in all of those scenarios, so the only reason not to use this
> technology all the time is if we think it's safe in some scenarios and
> not others. I can't think of a reason why it matters why we're exiting
> archive recovery, nor can I think of a reason why it matters whether
> we're in single user mode. The distinction between crash recovery and
> archive recovery does seem to matter, but if anything the crash
> recovery scenario seems simpler, because then there's only one
> timeline involved.
>
> I realize that conservatism may have played a role in this code ending
> up looking the way that it does; someone seems to have thought it
> would be better not to rely on a new idea in all cases. From my point
> of view, though, it's scary to have so many cases, especially cases
> that don't seem like they should ever be reached. I think that
> simplifying the logic here and trying to do the same things in as many
> cases as we can will lead to better robustness. Imagine if instead of
> all the hairy logic we have now we just replaced this whole if
> (IsInRecovery) stanza with this:
>
> if (InRecovery)
>     CreateEndOfRecoveryRecord();

+1, and do the checkpoint at the end unconditionally as we are doing
for the promotion.

Regards,
Amul



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: pg_receivewal starting position
Next
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side