Re: using an end-of-recovery record in all cases - Mailing list pgsql-hackers

From Amul Sul
Subject Re: using an end-of-recovery record in all cases
Date
Msg-id CAAJ_b95L6sxG+ZqBeJFAmVLqRaqznSRMftyBgpTNy1sD5mpq1Q@mail.gmail.com
Whole thread Raw
In response to Re: using an end-of-recovery record in all cases  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: using an end-of-recovery record in all cases
List pgsql-hackers
On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > The cfbot reports that this version of the patch doesn't apply anymore:
>
> Here is a new version of the patch which, unlike v1, I think is
> something we could seriously consider applying (not before v16, of
> course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
> attach a second patch as well which nukes checkPoint.PrevTimeLineID as
> well.
>
> I mentioned two problems with $SUBJECT in the first email with this
> subject line. One was a bug, which Noah has since fixed (thanks,
> Noah). The other problem is that LogStandbySnapshot() and a bunch of
> its friends expect latestCompletedXid to always be a normal XID, which
> is a problem because (1) we currently set nextXid to 3 and (2) at
> startup, we compute latestCompletedXid = nextXid - 1. The current code
> dodges this kind of accidentally: the checkpoint that happens at
> startup is a "shutdown checkpoint" and thus skips logging a standby
> snapshot, since a shutdown checkpoint is a sure indicator that there
> are no running transactions. With the changes, the checkpoint at
> startup happens after we've started allowing write transactions, and
> thus a standby snapshot needs to be logged also. In the cases where
> the end-of-recovery record was already being used, the problem could
> have happened already, except for the fact that those cases involve a
> standby promotion, which doesn't happen during initdb. I explored a
> few possible ways of solving this problem.
>
> The first thing I considered was replacing latestCompletedXid with a
> name like incompleteXidHorizon. The idea is that, where
> latestCompletedXid is the highest XID that is known to have committed
> or aborted, incompleteXidHorizon would be the lowest XID such that all
> known commits or aborts are for lower XIDs. In effect,
> incompleteXidHorizon would be latestCompletedXid + 1. Since
> latestCompletedXid is always normal except when it's 2,
> incompleteXidHorizon would be normal in all cases. Initially this
> seemed fairly promising, but it kind of fell down when I realized that
> we copy latestCompletedXid into
> ComputeXidHorizonsResult.latest_completed. That seemed to me to make
> the consequences of the change a bit more far-reaching than I liked.
> Also, it wasn't entirely clear to me that I wouldn't be introducing
> any off-by-one errors into various wraparound calculations with this
> approach.
>
> The second thing I considered was skipping LogStandbySnapshot() during
> initdb. There are two ways of doing this and neither of them seem that
> great to me. Something that does work is to skip LogStandbySnapshot()
> when in single user mode, but there is no particular reason why WAL
> generated in single user mode couldn't be replayed on a standby, so
> this doesn't seem great. It's too big a hammer for what we really
> want, which is just to suppress this during initdb. Another way of
> approaching it is to skip it in bootstrap mode, but that actually
> doesn't work: initdb then fails during the post-bootstrapping step
> rather than during bootstrapping. I thought about patching around that
> by forcing the code that generates checkpoint records to forcibly
> advance an XID of 3 to 4, but that seemed like working around the
> problem from the wrong end.
>
> So ... I decided that the best approach here seems to be to just skip
> FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
> first write transaction that the cluster ever processes. That's very
> simple and doesn't seem likely to break anything else. On the downside
> it seems a bit grotty, but I don't see anything better, and on the
> whole, I think with this approach we come out substantially ahead.
> 0001 removes 3 times as many lines as it inserts, and 0002 saves a few
> more lines of code.
>
> Now, I still don't really know that there isn't some theoretical
> difficulty here that makes this whole approach a non-starter, but I
> also can't think of what it might be. If the promotion case, which has
> used the end-of-recovery record for many years, is basically safe,
> despite the fact that it switches TLIs, then it seems to me that the
> crash recovery case, which doesn't have that complication, ought to be
> safe too. But I might well be missing something, so if you see a
> problem, please speak up!
>

  /*
- * If this was a promotion, request an (online) checkpoint now. This isn't
- * required for consistency, but the last restartpoint might be far back,
- * and in case of a crash, recovering from it might take a longer than is
- * appropriate now that we're not in standby mode anymore.
+ * Request an (online) checkpoint now. This isn't required for consistency,
+ * but the last restartpoint might be far back, and in case of a crash,
+ * recovering from it might take a longer than is appropriate now that
+ * we're not in standby mode anymore.
  */
- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);
 }

I think RequestCheckpoint() should be called conditionally.  What is the need
of the checkpoint if we haven't been through the recovery, in other words,
starting up from a clean shutdown?

Regards,
Amul



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Amit Kapila
Date:
Subject: Re: Column Filtering in Logical Replication