Re: using an end-of-recovery record in all cases - Mailing list pgsql-hackers
From | Amul Sul |
---|---|
Subject | Re: using an end-of-recovery record in all cases |
Date | |
Msg-id | CAAJ_b95L6sxG+ZqBeJFAmVLqRaqznSRMftyBgpTNy1sD5mpq1Q@mail.gmail.com Whole thread Raw |
In response to | Re: using an end-of-recovery record in all cases (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: using an end-of-recovery record in all cases
|
List | pgsql-hackers |
On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > The cfbot reports that this version of the patch doesn't apply anymore: > > Here is a new version of the patch which, unlike v1, I think is > something we could seriously consider applying (not before v16, of > course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I > attach a second patch as well which nukes checkPoint.PrevTimeLineID as > well. > > I mentioned two problems with $SUBJECT in the first email with this > subject line. One was a bug, which Noah has since fixed (thanks, > Noah). The other problem is that LogStandbySnapshot() and a bunch of > its friends expect latestCompletedXid to always be a normal XID, which > is a problem because (1) we currently set nextXid to 3 and (2) at > startup, we compute latestCompletedXid = nextXid - 1. The current code > dodges this kind of accidentally: the checkpoint that happens at > startup is a "shutdown checkpoint" and thus skips logging a standby > snapshot, since a shutdown checkpoint is a sure indicator that there > are no running transactions. With the changes, the checkpoint at > startup happens after we've started allowing write transactions, and > thus a standby snapshot needs to be logged also. In the cases where > the end-of-recovery record was already being used, the problem could > have happened already, except for the fact that those cases involve a > standby promotion, which doesn't happen during initdb. I explored a > few possible ways of solving this problem. > > The first thing I considered was replacing latestCompletedXid with a > name like incompleteXidHorizon. The idea is that, where > latestCompletedXid is the highest XID that is known to have committed > or aborted, incompleteXidHorizon would be the lowest XID such that all > known commits or aborts are for lower XIDs. In effect, > incompleteXidHorizon would be latestCompletedXid + 1. Since > latestCompletedXid is always normal except when it's 2, > incompleteXidHorizon would be normal in all cases. Initially this > seemed fairly promising, but it kind of fell down when I realized that > we copy latestCompletedXid into > ComputeXidHorizonsResult.latest_completed. That seemed to me to make > the consequences of the change a bit more far-reaching than I liked. > Also, it wasn't entirely clear to me that I wouldn't be introducing > any off-by-one errors into various wraparound calculations with this > approach. > > The second thing I considered was skipping LogStandbySnapshot() during > initdb. There are two ways of doing this and neither of them seem that > great to me. Something that does work is to skip LogStandbySnapshot() > when in single user mode, but there is no particular reason why WAL > generated in single user mode couldn't be replayed on a standby, so > this doesn't seem great. It's too big a hammer for what we really > want, which is just to suppress this during initdb. Another way of > approaching it is to skip it in bootstrap mode, but that actually > doesn't work: initdb then fails during the post-bootstrapping step > rather than during bootstrapping. I thought about patching around that > by forcing the code that generates checkpoint records to forcibly > advance an XID of 3 to 4, but that seemed like working around the > problem from the wrong end. > > So ... I decided that the best approach here seems to be to just skip > FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the > first write transaction that the cluster ever processes. That's very > simple and doesn't seem likely to break anything else. On the downside > it seems a bit grotty, but I don't see anything better, and on the > whole, I think with this approach we come out substantially ahead. > 0001 removes 3 times as many lines as it inserts, and 0002 saves a few > more lines of code. > > Now, I still don't really know that there isn't some theoretical > difficulty here that makes this whole approach a non-starter, but I > also can't think of what it might be. If the promotion case, which has > used the end-of-recovery record for many years, is basically safe, > despite the fact that it switches TLIs, then it seems to me that the > crash recovery case, which doesn't have that complication, ought to be > safe too. But I might well be missing something, so if you see a > problem, please speak up! > /* - * If this was a promotion, request an (online) checkpoint now. This isn't - * required for consistency, but the last restartpoint might be far back, - * and in case of a crash, recovering from it might take a longer than is - * appropriate now that we're not in standby mode anymore. + * Request an (online) checkpoint now. This isn't required for consistency, + * but the last restartpoint might be far back, and in case of a crash, + * recovering from it might take a longer than is appropriate now that + * we're not in standby mode anymore. */ - if (promoted) - RequestCheckpoint(CHECKPOINT_FORCE); + RequestCheckpoint(CHECKPOINT_FORCE); } I think RequestCheckpoint() should be called conditionally. What is the need of the checkpoint if we haven't been through the recovery, in other words, starting up from a clean shutdown? Regards, Amul
pgsql-hackers by date: