Thread: Re: using an end-of-recovery record in all cases

Re: using an end-of-recovery record in all cases

From

Kyotaro Horiguchi

Date:

07 September 2021, 07:26:33

At Fri, 3 Sep 2021 15:56:35 +0530, Amul Sul <sulamul@gmail.com> wrote in 
> On Fri, Sep 3, 2021 at 10:23 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> You might need the following change at the end of StartupXLOG():
> 
> - if (promoted)
> - RequestCheckpoint(CHECKPOINT_FORCE);
> + RequestCheckpoint(CHECKPOINT_FORCE);

At Fri, 3 Sep 2021 10:13:53 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> Did you use the patch I posted, or a different one

Thanks to both of you. That was my stupid.

..But even with the patch (with removal of 018_..pl test file), I
didn't get check-world failed..  I'll retry later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: using an end-of-recovery record in all cases

From

Amul Sul

Date:

05 October 2021, 11:43:41

I was trying to understand the v1 patch and found that at the end
RequestCheckpoint() is called unconditionally, I think that should
have been called if REDO had performed, here is the snip from the v1
patch:

  /*
- * If this was a promotion, request an (online) checkpoint now. This isn't
- * required for consistency, but the last restartpoint might be far back,
- * and in case of a crash, recovering from it might take a longer than is
- * appropriate now that we're not in standby mode anymore.
+ * Request an (online) checkpoint now. Note that, until this is complete,
+ * a crash would start replay from the same WAL location we did, or from
+ * the last restartpoint that completed. We don't want to let that
+ * situation persist for longer than necessary, since users don't like
+ * long recovery times. On the other hand, they also want to be able to
+ * start doing useful work again as quickly as possible. Therfore, we
+ * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system.
+ *
+ * Note that the consequence of requesting a checkpoint here only after
+ * we've allowed WAL writes is that a single checkpoint cycle can span
+ * multiple server lifetimes. So for example if you want to something to
+ * happen at least once per checkpoint cycle or at most once per
+ * checkpoint cycle, you have to consider what happens if the server
+ * is restarted someplace in the middle.
  */
- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);

When I try to call that conditionally like attached, I don't see any
regression failure, correct me if I am missing something here.

Regards,
Amul

Attachment

trial.patch

Re: using an end-of-recovery record in all cases

From

Robert Haas

Date:

05 October 2021, 15:34:11

On Tue, Oct 5, 2021 at 7:44 AM Amul Sul <sulamul@gmail.com> wrote:
> I was trying to understand the v1 patch and found that at the end
> RequestCheckpoint() is called unconditionally, I think that should
> have been called if REDO had performed,

You're right. But I don't think we need an extra variable like this,
right? We can just test InRecovery?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: using an end-of-recovery record in all cases

From

Amul Sul

Date:

05 October 2021, 16:41:10

On Tue, 5 Oct 2021 at 9:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 5, 2021 at 7:44 AM Amul Sul <sulamul@gmail.com> wrote:
> I was trying to understand the v1 patch and found that at the end
> RequestCheckpoint() is called unconditionally, I think that should
> have been called if REDO had performed,

You're right. But I don't think we need an extra variable like this,
right? We can just test InRecovery?

No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggested in other thread.

Regards,

Amul

Regards,
Amul Sul
EDB: http://www.enterprisedb.com

Re: using an end-of-recovery record in all cases

From

Robert Haas

Date:

05 October 2021, 17:12:03

On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote:
> No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggested
inother thread.
 

Hmm, right, that makes sense. Perhaps I should start remembering what
I said in my own emails.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: using an end-of-recovery record in all cases

From

Amul Sul

Date:

06 October 2021, 13:54:23

On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote:
> > No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have
suggestedin other thread.

>
> Hmm, right, that makes sense. Perhaps I should start remembering what
> I said in my own emails.
>

Here I end up with the attached version where I have dropped the
changes for standby.c and 018_wal_optimize.pl files.  Also, I am not
sure that we should have the changes for bgwriter.c and slot.c in this
patch, but that's not touched.

Regards,
Amul

Attachment

v2-0001-Always-use-an-end-of-recovery-record-rather-than-.patch

Re: using an end-of-recovery record in all cases

From

Amul Sul

Date:

18 October 2021, 05:26:53

On Wed, Oct 6, 2021 at 7:24 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote:
> > > No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have
suggestedin other thread.
 
> >
> > Hmm, right, that makes sense. Perhaps I should start remembering what
> > I said in my own emails.
> >
>
> Here I end up with the attached version where I have dropped the
> changes for standby.c and 018_wal_optimize.pl files.  Also, I am not
> sure that we should have the changes for bgwriter.c and slot.c in this
> patch, but that's not touched.
>

Attached is the rebased and updated version. The patch removes the
newly introduced PerformRecoveryXLogAction() function. In addition to
that, removed the CHECKPOINT_END_OF_RECOVERY flag and its related
code.  Also, dropped changes for bgwriter.c and slot.c in this patch, which
seem unrelated to this work.

Regards,
Amul

Attachment

v3-0001-Always-use-an-end-of-recovery-record-rather-than-.patch

Re: using an end-of-recovery record in all cases

From

Julien Rouhaud

Date:

16 January 2022, 04:52:28

Hi,

On Mon, Oct 18, 2021 at 10:56:53AM +0530, Amul Sul wrote:
> 
> Attached is the rebased and updated version. The patch removes the
> newly introduced PerformRecoveryXLogAction() function. In addition to
> that, removed the CHECKPOINT_END_OF_RECOVERY flag and its related
> code.  Also, dropped changes for bgwriter.c and slot.c in this patch, which
> seem unrelated to this work.

The cfbot reports that this version of the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3365.log
=== Applying patches on top of PostgreSQL commit ID 0c53a6658e47217ad3dd416a5543fc87c3ecfd58 ===
=== applying patch ./v3-0001-Always-use-an-end-of-recovery-record-rather-than-.patch
patching file src/backend/access/transam/xlog.c
[...]
Hunk #14 FAILED at 9061.
Hunk #15 FAILED at 9241.
2 out of 15 hunks FAILED -- saving rejects to file src/backend/access/transam/xlog.c.rej

Can you send a rebased version?  In the meantime I will switch the cf entry to
Waiting on Author.

Re: using an end-of-recovery record in all cases

From

Robert Haas

Date:

18 April 2022, 20:44:03

On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> The cfbot reports that this version of the patch doesn't apply anymore:

Here is a new version of the patch which, unlike v1, I think is
something we could seriously consider applying (not before v16, of
course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
attach a second patch as well which nukes checkPoint.PrevTimeLineID as
well.

I mentioned two problems with $SUBJECT in the first email with this
subject line. One was a bug, which Noah has since fixed (thanks,
Noah). The other problem is that LogStandbySnapshot() and a bunch of
its friends expect latestCompletedXid to always be a normal XID, which
is a problem because (1) we currently set nextXid to 3 and (2) at
startup, we compute latestCompletedXid = nextXid - 1. The current code
dodges this kind of accidentally: the checkpoint that happens at
startup is a "shutdown checkpoint" and thus skips logging a standby
snapshot, since a shutdown checkpoint is a sure indicator that there
are no running transactions. With the changes, the checkpoint at
startup happens after we've started allowing write transactions, and
thus a standby snapshot needs to be logged also. In the cases where
the end-of-recovery record was already being used, the problem could
have happened already, except for the fact that those cases involve a
standby promotion, which doesn't happen during initdb. I explored a
few possible ways of solving this problem.

The first thing I considered was replacing latestCompletedXid with a
name like incompleteXidHorizon. The idea is that, where
latestCompletedXid is the highest XID that is known to have committed
or aborted, incompleteXidHorizon would be the lowest XID such that all
known commits or aborts are for lower XIDs. In effect,
incompleteXidHorizon would be latestCompletedXid + 1. Since
latestCompletedXid is always normal except when it's 2,
incompleteXidHorizon would be normal in all cases. Initially this
seemed fairly promising, but it kind of fell down when I realized that
we copy latestCompletedXid into
ComputeXidHorizonsResult.latest_completed. That seemed to me to make
the consequences of the change a bit more far-reaching than I liked.
Also, it wasn't entirely clear to me that I wouldn't be introducing
any off-by-one errors into various wraparound calculations with this
approach.

The second thing I considered was skipping LogStandbySnapshot() during
initdb. There are two ways of doing this and neither of them seem that
great to me. Something that does work is to skip LogStandbySnapshot()
when in single user mode, but there is no particular reason why WAL
generated in single user mode couldn't be replayed on a standby, so
this doesn't seem great. It's too big a hammer for what we really
want, which is just to suppress this during initdb. Another way of
approaching it is to skip it in bootstrap mode, but that actually
doesn't work: initdb then fails during the post-bootstrapping step
rather than during bootstrapping. I thought about patching around that
by forcing the code that generates checkpoint records to forcibly
advance an XID of 3 to 4, but that seemed like working around the
problem from the wrong end.

So ... I decided that the best approach here seems to be to just skip
FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
first write transaction that the cluster ever processes. That's very
simple and doesn't seem likely to break anything else. On the downside
it seems a bit grotty, but I don't see anything better, and on the
whole, I think with this approach we come out substantially ahead.
0001 removes 3 times as many lines as it inserts, and 0002 saves a few
more lines of code.

Now, I still don't really know that there isn't some theoretical
difficulty here that makes this whole approach a non-starter, but I
also can't think of what it might be. If the promotion case, which has
used the end-of-recovery record for many years, is basically safe,
despite the fact that it switches TLIs, then it seems to me that the
crash recovery case, which doesn't have that complication, ought to be
safe too. But I might well be missing something, so if you see a
problem, please speak up!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: using an end-of-recovery record in all cases

From

Amul Sul

Date:

19 April 2022, 03:56:00

On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > The cfbot reports that this version of the patch doesn't apply anymore:
>
> Here is a new version of the patch which, unlike v1, I think is
> something we could seriously consider applying (not before v16, of
> course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
> attach a second patch as well which nukes checkPoint.PrevTimeLineID as
> well.
>
> I mentioned two problems with $SUBJECT in the first email with this
> subject line. One was a bug, which Noah has since fixed (thanks,
> Noah). The other problem is that LogStandbySnapshot() and a bunch of
> its friends expect latestCompletedXid to always be a normal XID, which
> is a problem because (1) we currently set nextXid to 3 and (2) at
> startup, we compute latestCompletedXid = nextXid - 1. The current code
> dodges this kind of accidentally: the checkpoint that happens at
> startup is a "shutdown checkpoint" and thus skips logging a standby
> snapshot, since a shutdown checkpoint is a sure indicator that there
> are no running transactions. With the changes, the checkpoint at
> startup happens after we've started allowing write transactions, and
> thus a standby snapshot needs to be logged also. In the cases where
> the end-of-recovery record was already being used, the problem could
> have happened already, except for the fact that those cases involve a
> standby promotion, which doesn't happen during initdb. I explored a
> few possible ways of solving this problem.
>
> The first thing I considered was replacing latestCompletedXid with a
> name like incompleteXidHorizon. The idea is that, where
> latestCompletedXid is the highest XID that is known to have committed
> or aborted, incompleteXidHorizon would be the lowest XID such that all
> known commits or aborts are for lower XIDs. In effect,
> incompleteXidHorizon would be latestCompletedXid + 1. Since
> latestCompletedXid is always normal except when it's 2,
> incompleteXidHorizon would be normal in all cases. Initially this
> seemed fairly promising, but it kind of fell down when I realized that
> we copy latestCompletedXid into
> ComputeXidHorizonsResult.latest_completed. That seemed to me to make
> the consequences of the change a bit more far-reaching than I liked.
> Also, it wasn't entirely clear to me that I wouldn't be introducing
> any off-by-one errors into various wraparound calculations with this
> approach.
>
> The second thing I considered was skipping LogStandbySnapshot() during
> initdb. There are two ways of doing this and neither of them seem that
> great to me. Something that does work is to skip LogStandbySnapshot()
> when in single user mode, but there is no particular reason why WAL
> generated in single user mode couldn't be replayed on a standby, so
> this doesn't seem great. It's too big a hammer for what we really
> want, which is just to suppress this during initdb. Another way of
> approaching it is to skip it in bootstrap mode, but that actually
> doesn't work: initdb then fails during the post-bootstrapping step
> rather than during bootstrapping. I thought about patching around that
> by forcing the code that generates checkpoint records to forcibly
> advance an XID of 3 to 4, but that seemed like working around the
> problem from the wrong end.
>
> So ... I decided that the best approach here seems to be to just skip
> FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
> first write transaction that the cluster ever processes. That's very
> simple and doesn't seem likely to break anything else. On the downside
> it seems a bit grotty, but I don't see anything better, and on the
> whole, I think with this approach we come out substantially ahead.
> 0001 removes 3 times as many lines as it inserts, and 0002 saves a few
> more lines of code.
>
> Now, I still don't really know that there isn't some theoretical
> difficulty here that makes this whole approach a non-starter, but I
> also can't think of what it might be. If the promotion case, which has
> used the end-of-recovery record for many years, is basically safe,
> despite the fact that it switches TLIs, then it seems to me that the
> crash recovery case, which doesn't have that complication, ought to be
> safe too. But I might well be missing something, so if you see a
> problem, please speak up!
>

  /*
- * If this was a promotion, request an (online) checkpoint now. This isn't
- * required for consistency, but the last restartpoint might be far back,
- * and in case of a crash, recovering from it might take a longer than is
- * appropriate now that we're not in standby mode anymore.
+ * Request an (online) checkpoint now. This isn't required for consistency,
+ * but the last restartpoint might be far back, and in case of a crash,
+ * recovering from it might take a longer than is appropriate now that
+ * we're not in standby mode anymore.
  */
- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);
 }

I think RequestCheckpoint() should be called conditionally.  What is the need
of the checkpoint if we haven't been through the recovery, in other words,
starting up from a clean shutdown?

Regards,
Amul

Re: using an end-of-recovery record in all cases

From

Robert Haas

Date:

19 April 2022, 12:30:26

On Mon, Apr 18, 2022 at 11:56 PM Amul Sul <sulamul@gmail.com> wrote:
> I think RequestCheckpoint() should be called conditionally.  What is the need
> of the checkpoint if we haven't been through the recovery, in other words,
> starting up from a clean shutdown?

Good point. v5 attached.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: using an end-of-recovery record in all cases

From

Nathan Bossart

Date:

19 April 2022, 20:37:59

On Mon, Apr 18, 2022 at 04:44:03PM -0400, Robert Haas wrote:
> Here is a new version of the patch which, unlike v1, I think is
> something we could seriously consider applying (not before v16, of
> course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
> attach a second patch as well which nukes checkPoint.PrevTimeLineID as
> well.

I'd like to add a big +1 for this change.  IIUC this should help with some
of the problems I've noted elsewhere [0].

> I mentioned two problems with $SUBJECT in the first email with this
> subject line. One was a bug, which Noah has since fixed (thanks,
> Noah). The other problem is that LogStandbySnapshot() and a bunch of
> its friends expect latestCompletedXid to always be a normal XID, which
> is a problem because (1) we currently set nextXid to 3 and (2) at
> startup, we compute latestCompletedXid = nextXid - 1. The current code
> dodges this kind of accidentally: the checkpoint that happens at
> startup is a "shutdown checkpoint" and thus skips logging a standby
> snapshot, since a shutdown checkpoint is a sure indicator that there
> are no running transactions. With the changes, the checkpoint at
> startup happens after we've started allowing write transactions, and
> thus a standby snapshot needs to be logged also. In the cases where
> the end-of-recovery record was already being used, the problem could
> have happened already, except for the fact that those cases involve a
> standby promotion, which doesn't happen during initdb. I explored a
> few possible ways of solving this problem.

Shouldn't latestCompletedXid be set to MaxTransactionId in this case?  Or
is this related to the logic in FullTransactionIdRetreat() that avoids
skipping over the "actual" special transaction IDs?

> So ... I decided that the best approach here seems to be to just skip
> FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
> first write transaction that the cluster ever processes. That's very
> simple and doesn't seem likely to break anything else. On the downside
> it seems a bit grotty, but I don't see anything better, and on the
> whole, I think with this approach we come out substantially ahead.
> 0001 removes 3 times as many lines as it inserts, and 0002 saves a few
> more lines of code.

This doesn't seem all that bad to me.  It's a little hacky, but it's very
easy to understand and only happens once per initdb.  I don't think it's
worth any extra complexity.

> Now, I still don't really know that there isn't some theoretical
> difficulty here that makes this whole approach a non-starter, but I
> also can't think of what it might be. If the promotion case, which has
> used the end-of-recovery record for many years, is basically safe,
> despite the fact that it switches TLIs, then it seems to me that the
> crash recovery case, which doesn't have that complication, ought to be
> safe too. But I might well be missing something, so if you see a
> problem, please speak up!

Your reasoning seems sound to me.

[0] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: using an end-of-recovery record in all cases

From

Kyotaro Horiguchi

Date:

20 April 2022, 01:41:07

At Tue, 19 Apr 2022 13:37:59 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Mon, Apr 18, 2022 at 04:44:03PM -0400, Robert Haas wrote:
> > Here is a new version of the patch which, unlike v1, I think is
> > something we could seriously consider applying (not before v16, of
> > course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
> > attach a second patch as well which nukes checkPoint.PrevTimeLineID as
> > well.
> 
> I'd like to add a big +1 for this change.  IIUC this should help with some
> of the problems I've noted elsewhere [0].

Agreed.

> > I mentioned two problems with $SUBJECT in the first email with this
> > subject line. One was a bug, which Noah has since fixed (thanks,
> > Noah). The other problem is that LogStandbySnapshot() and a bunch of
> > its friends expect latestCompletedXid to always be a normal XID, which
> > is a problem because (1) we currently set nextXid to 3 and (2) at
> > startup, we compute latestCompletedXid = nextXid - 1. The current code
> > dodges this kind of accidentally: the checkpoint that happens at
> > startup is a "shutdown checkpoint" and thus skips logging a standby
> > snapshot, since a shutdown checkpoint is a sure indicator that there
> > are no running transactions. With the changes, the checkpoint at
> > startup happens after we've started allowing write transactions, and
> > thus a standby snapshot needs to be logged also. In the cases where
> > the end-of-recovery record was already being used, the problem could
> > have happened already, except for the fact that those cases involve a
> > standby promotion, which doesn't happen during initdb. I explored a
> > few possible ways of solving this problem.
> 
> Shouldn't latestCompletedXid be set to MaxTransactionId in this case?  Or
> is this related to the logic in FullTransactionIdRetreat() that avoids
> skipping over the "actual" special transaction IDs?

As the result FullTransactionIdRetreat(FirstNormalFullTransactionId)
results in FrozenTransactionId, which looks odd.  It seems to me
rather should be InvalidFullTransactionId, or simply should assert-out
that case.  But incrmenting the very first xid avoid all that
complexity.  It is somewhat hacky but very smiple and understandable.

> > So ... I decided that the best approach here seems to be to just skip
> > FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
> > first write transaction that the cluster ever processes. That's very
> > simple and doesn't seem likely to break anything else. On the downside
> > it seems a bit grotty, but I don't see anything better, and on the
> > whole, I think with this approach we come out substantially ahead.
> > 0001 removes 3 times as many lines as it inserts, and 0002 saves a few
> > more lines of code.
> 
> This doesn't seem all that bad to me.  It's a little hacky, but it's very
> easy to understand and only happens once per initdb.  I don't think it's
> worth any extra complexity.

+1.

> > Now, I still don't really know that there isn't some theoretical
> > difficulty here that makes this whole approach a non-starter, but I
> > also can't think of what it might be. If the promotion case, which has
> > used the end-of-recovery record for many years, is basically safe,
> > despite the fact that it switches TLIs, then it seems to me that the
> > crash recovery case, which doesn't have that complication, ought to be
> > safe too. But I might well be missing something, so if you see a
> > problem, please speak up!
> 
> Your reasoning seems sound to me.
> 
> [0] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com

FWIW, I don't find a flaw in the reasoning, too.

By the way do we need to leave CheckPoint.PrevTimeLineID? It is always
the same value with ThisTimeLineID.  The most dubious part is
ApplyWalRecord but XLOG_CHECKPOINT_SHUTDOWN no longer induces timeline
switch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: using an end-of-recovery record in all cases

From

Amul Sul

Date:

20 April 2022, 06:54:07

 On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > The cfbot reports that this version of the patch doesn't apply anymore:
>
> Here is a new version of the patch which, unlike v1, I think is
> something we could seriously consider applying (not before v16, of
> course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
> attach a second patch as well which nukes checkPoint.PrevTimeLineID as
> well.
>
> I mentioned two problems with $SUBJECT in the first email with this
> subject line. One was a bug, which Noah has since fixed (thanks,
> Noah). The other problem is that LogStandbySnapshot() and a bunch of
> its friends expect latestCompletedXid to always be a normal XID, which
> is a problem because (1) we currently set nextXid to 3 and (2) at
> startup, we compute latestCompletedXid = nextXid - 1. The current code
> dodges this kind of accidentally: the checkpoint that happens at
> startup is a "shutdown checkpoint" and thus skips logging a standby
> snapshot, since a shutdown checkpoint is a sure indicator that there
> are no running transactions. With the changes, the checkpoint at
> startup happens after we've started allowing write transactions, and
> thus a standby snapshot needs to be logged also. In the cases where
> the end-of-recovery record was already being used, the problem could
> have happened already, except for the fact that those cases involve a
> standby promotion, which doesn't happen during initdb. I explored a
> few possible ways of solving this problem.
>
> The first thing I considered was replacing latestCompletedXid with a
> name like incompleteXidHorizon. The idea is that, where
> latestCompletedXid is the highest XID that is known to have committed
> or aborted, incompleteXidHorizon would be the lowest XID such that all
> known commits or aborts are for lower XIDs. In effect,
> incompleteXidHorizon would be latestCompletedXid + 1. Since
> latestCompletedXid is always normal except when it's 2,
> incompleteXidHorizon would be normal in all cases. Initially this
> seemed fairly promising, but it kind of fell down when I realized that
> we copy latestCompletedXid into
> ComputeXidHorizonsResult.latest_completed. That seemed to me to make
> the consequences of the change a bit more far-reaching than I liked.
> Also, it wasn't entirely clear to me that I wouldn't be introducing
> any off-by-one errors into various wraparound calculations with this
> approach.
>
> The second thing I considered was skipping LogStandbySnapshot() during
> initdb. There are two ways of doing this and neither of them seem that
> great to me. Something that does work is to skip LogStandbySnapshot()
> when in single user mode, but there is no particular reason why WAL
> generated in single user mode couldn't be replayed on a standby, so
> this doesn't seem great. It's too big a hammer for what we really
> want, which is just to suppress this during initdb. Another way of
> approaching it is to skip it in bootstrap mode, but that actually
> doesn't work: initdb then fails during the post-bootstrapping step
> rather than during bootstrapping. I thought about patching around that
> by forcing the code that generates checkpoint records to forcibly
> advance an XID of 3 to 4, but that seemed like working around the
> problem from the wrong end.
>
> So ... I decided that the best approach here seems to be to just skip
> FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
> first write transaction that the cluster ever processes. That's very
> simple and doesn't seem likely to break anything else. On the downside
> it seems a bit grotty, but I don't see anything better, and on the
> whole, I think with this approach we come out substantially ahead.

IIUC, the failure was something like this on initdb:

running bootstrap script ... TRAP:
FailedAssertion("TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)",
File: "procarray.c", Line: 2892, PID: 60363)

/bin/postgres(ExceptionalCondition+0xb9)[0xb3917d]
/bin/postgres(GetRunningTransactionData+0x36c)[0x96aa26]
/bin/postgres(LogStandbySnapshot+0x64)[0x974393]
/bin/postgres(CreateCheckPoint+0x67f)[0x5928bf]
/bin/postgres(RequestCheckpoint+0x26)[0x8ca649]
/bin/postgres(StartupXLOG+0xf51)[0x591126]
/bin/postgres(InitPostgres+0x188)[0xb4f2ac]
/bin/postgres(BootstrapModeMain+0x4d3)[0x5ac6de]
/bin/postgres(main+0x275)[0x7ca72e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f71af82d445]
/bin/postgres[0x48aae9]
child process was terminated by signal 6: Aborted
initdb: removing data directory "/inst/data"

That was happening because RequestCheckpoint() was called from StartupXLOG()
unconditionally, but with the v5 patch that is not true.

If my understanding is correct then we don't need any handling
for latestCompletedXid, at least in this patch.

Regards,
Amul

Re: using an end-of-recovery record in all cases

From

Robert Haas

Date:

20 April 2022, 13:26:07

On Tue, Apr 19, 2022 at 4:38 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Shouldn't latestCompletedXid be set to MaxTransactionId in this case?  Or
> is this related to the logic in FullTransactionIdRetreat() that avoids
> skipping over the "actual" special transaction IDs?

The problem here is this code:

    /* also initialize latestCompletedXid, to nextXid - 1 */
    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
    ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
    FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
    LWLockRelease(ProcArrayLock);

If nextXid is 3, then latestCompletedXid gets 2. But in
GetRunningTransactionData:

    Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid));

> Your reasoning seems sound to me.

I was talking with Thomas Munro yesterday and he thinks there is a
problem with relfilenode reuse here. In normal running, when a
relation is dropped, we leave behind a 0-length file until the next
checkpoint; this keeps that relfilenode from being used even if the
OID counter wraps around. If we didn't do that, then imagine that
while running with wal_level=minimal, we drop an existing relation,
create a new relation with the same OID, load some data into it, and
crash, all within the same checkpoint cycle, then we will be able to
replay the drop, but we will not be able to restore the relation
contents afterward because at wal_level=minimal they are not logged.
Apparently, we don't create tombstone files during recovery because we
know that there will be a checkpoint at the end.

With the existing use of the end-of-recovery record, we always know
that wal_level>minimal, because we're only using it on standbys. But
with this use that wouldn't be true any more. So I guess we need to
start creating tombstone files even during recovery, or else do
something like what Dilip coded up in
http://postgr.es/m/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com
which I think would be a better solution at least in the long term.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: using an end-of-recovery record in all cases

From

Nathan Bossart

Date:

20 April 2022, 17:02:24

On Wed, Apr 20, 2022 at 09:26:07AM -0400, Robert Haas wrote:
> I was talking with Thomas Munro yesterday and he thinks there is a
> problem with relfilenode reuse here. In normal running, when a
> relation is dropped, we leave behind a 0-length file until the next
> checkpoint; this keeps that relfilenode from being used even if the
> OID counter wraps around. If we didn't do that, then imagine that
> while running with wal_level=minimal, we drop an existing relation,
> create a new relation with the same OID, load some data into it, and
> crash, all within the same checkpoint cycle, then we will be able to
> replay the drop, but we will not be able to restore the relation
> contents afterward because at wal_level=minimal they are not logged.
> Apparently, we don't create tombstone files during recovery because we
> know that there will be a checkpoint at the end.

In the example you provided, won't the tombstone file already be present
before the crash?  During recovery, the tombstone file will be removed, and
the new relation wouldn't use the same relfilenode anyway.  I'm probably
missing something obvious here.

I do see the problem if we drop an existing relation, crash, reuse the
filenode, and then crash again (all within the same checkpoint cycle).  The
first recovery would remove the tombstone file, and the second recovery
would wipe out the new relation's files.

> With the existing use of the end-of-recovery record, we always know
> that wal_level>minimal, because we're only using it on standbys. But
> with this use that wouldn't be true any more. So I guess we need to
> start creating tombstone files even during recovery, or else do
> something like what Dilip coded up in
> http://postgr.es/m/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com
> which I think would be a better solution at least in the long term.

IMO this would be good just to reduce the branching a bit.  I suppose
removing the files immediately during recovery might be an optimization in
some cases, but I am skeptical that it really makes that much of a
difference in practice.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: using an end-of-recovery record in all cases

From

Thomas Munro

Date:

20 April 2022, 17:16:20

On Thu, Apr 21, 2022 at 5:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I do see the problem if we drop an existing relation, crash, reuse the
> filenode, and then crash again (all within the same checkpoint cycle).  The
> first recovery would remove the tombstone file, and the second recovery
> would wipe out the new relation's files.

Right, the double-crash case is what I was worrying about.  I'm not
sure, but it might even be more likely than usual that you'll reuse
the same relfilenode after the first crash, because the OID allocator
will start from the same value.