Re: using an end-of-recovery record in all cases - Mailing list pgsql-hackers

From Robert Haas
Subject Re: using an end-of-recovery record in all cases
Date
Msg-id CA+TgmoY+SJLTjma4Hfn1sA7S6CZAgbihYd=KzO6srd7Ut=XVBQ@mail.gmail.com
Whole thread Raw
In response to Re: using an end-of-recovery record in all cases  (Julien Rouhaud <rjuju123@gmail.com>)
Responses Re: using an end-of-recovery record in all cases  (Amul Sul <sulamul@gmail.com>)
Re: using an end-of-recovery record in all cases  (Nathan Bossart <nathandbossart@gmail.com>)
Re: using an end-of-recovery record in all cases  (Amul Sul <sulamul@gmail.com>)
List pgsql-hackers
On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> The cfbot reports that this version of the patch doesn't apply anymore:

Here is a new version of the patch which, unlike v1, I think is
something we could seriously consider applying (not before v16, of
course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
attach a second patch as well which nukes checkPoint.PrevTimeLineID as
well.

I mentioned two problems with $SUBJECT in the first email with this
subject line. One was a bug, which Noah has since fixed (thanks,
Noah). The other problem is that LogStandbySnapshot() and a bunch of
its friends expect latestCompletedXid to always be a normal XID, which
is a problem because (1) we currently set nextXid to 3 and (2) at
startup, we compute latestCompletedXid = nextXid - 1. The current code
dodges this kind of accidentally: the checkpoint that happens at
startup is a "shutdown checkpoint" and thus skips logging a standby
snapshot, since a shutdown checkpoint is a sure indicator that there
are no running transactions. With the changes, the checkpoint at
startup happens after we've started allowing write transactions, and
thus a standby snapshot needs to be logged also. In the cases where
the end-of-recovery record was already being used, the problem could
have happened already, except for the fact that those cases involve a
standby promotion, which doesn't happen during initdb. I explored a
few possible ways of solving this problem.

The first thing I considered was replacing latestCompletedXid with a
name like incompleteXidHorizon. The idea is that, where
latestCompletedXid is the highest XID that is known to have committed
or aborted, incompleteXidHorizon would be the lowest XID such that all
known commits or aborts are for lower XIDs. In effect,
incompleteXidHorizon would be latestCompletedXid + 1. Since
latestCompletedXid is always normal except when it's 2,
incompleteXidHorizon would be normal in all cases. Initially this
seemed fairly promising, but it kind of fell down when I realized that
we copy latestCompletedXid into
ComputeXidHorizonsResult.latest_completed. That seemed to me to make
the consequences of the change a bit more far-reaching than I liked.
Also, it wasn't entirely clear to me that I wouldn't be introducing
any off-by-one errors into various wraparound calculations with this
approach.

The second thing I considered was skipping LogStandbySnapshot() during
initdb. There are two ways of doing this and neither of them seem that
great to me. Something that does work is to skip LogStandbySnapshot()
when in single user mode, but there is no particular reason why WAL
generated in single user mode couldn't be replayed on a standby, so
this doesn't seem great. It's too big a hammer for what we really
want, which is just to suppress this during initdb. Another way of
approaching it is to skip it in bootstrap mode, but that actually
doesn't work: initdb then fails during the post-bootstrapping step
rather than during bootstrapping. I thought about patching around that
by forcing the code that generates checkpoint records to forcibly
advance an XID of 3 to 4, but that seemed like working around the
problem from the wrong end.

So ... I decided that the best approach here seems to be to just skip
FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
first write transaction that the cluster ever processes. That's very
simple and doesn't seem likely to break anything else. On the downside
it seems a bit grotty, but I don't see anything better, and on the
whole, I think with this approach we come out substantially ahead.
0001 removes 3 times as many lines as it inserts, and 0002 saves a few
more lines of code.

Now, I still don't really know that there isn't some theoretical
difficulty here that makes this whole approach a non-starter, but I
also can't think of what it might be. If the promotion case, which has
used the end-of-recovery record for many years, is basically safe,
despite the fact that it switches TLIs, then it seems to me that the
crash recovery case, which doesn't have that complication, ought to be
safe too. But I might well be missing something, so if you see a
problem, please speak up!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Dump/Restore of non-default PKs
Next
From: Tom Lane
Date:
Subject: Re: Dump/Restore of non-default PKs