Thread: needless complexity in StartupXLOG

needless complexity in StartupXLOG

From
Robert Haas
Date:
StartupXLOG() has code beginning around line 7900 of xlog.c that
decides, at the end of recovery, between four possible courses of
action. It either writes an end-of-recovery record, or requests a
checkpoint, or creates a checkpoint, or does nothing, depending on the
value of 3 flag variables, and also on whether we're still able to
read the last checkpoint record:

                checkPointLoc = ControlFile->checkPoint;

                /*
                 * Confirm the last checkpoint is available for us to recover
                 * from if we fail.
                 */
                record = ReadCheckpointRecord(xlogreader,
checkPointLoc, 1, false);
                if (record != NULL)
                {
                    promoted = true;

It seems to me that ReadCheckpointRecord() should never fail here. It
should always be true, even when we're not in recovery, that the last
checkpoint record is readable. If there's ever a situation where that
is not true, even for an instant, then a crash at that point will be
unrecoverable. Now, one way that it could happen is if we've got a bug
in the code someplace that removes WAL segments too soon. However, if
we have such a bug, we should fix it. What this code does is says "oh,
I guess we removed the old checkpoint too soon, no big deal, let's
just be more aggressive about getting the next one done," which I do
not think is the right response. Aside from a bug, the only other way
I can see it happening is if someone is manually removing WAL segments
as the server is running through recovery, perhaps as some last-ditch
play to avoid running out of disk space. I don't think the server
needs to have - or should have - code to cater to such crazy
scenarios. Therefore I think that this check, at least in its current
form, is not sensible.

My first thought was that we should do the check unconditionally,
rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
ERROR if it fails. But then I wondered what the point of that would be
exactly. If we have such a bug -- and to the best of my knowledge
there's no evidence that we do -- there's no particular reason it
should only happen at the end of recovery. It could happen any time
the system -- or the user, or malicious space aliens -- remove files
from pg_wal, and we have no real idea about the timing of malicious
space alien activity, so doing the test here rather than anywhere else
just seems like a shot in the dark. Perhaps the most logical place
would be to move it to CreateCheckPoint() just after we remove old
xlog files, but we don't have the xlogreader there, and anyway I don't
see how it's really going to help. What bug do we expect to catch by
removing files we think we don't need and then checking that we didn't
remove the files we think we do need? That seems more like grasping at
straws than a serious attempt to make things work better.

So at the moment I am leaning toward the view that we should just
remove this check entirely, as in the attached, proposed patch.

Really, I think we should consider going further. If it's safe to
write an end-of-recovery record rather than a checkpoint, why not do
so all the time? Right now we fail to do that in the above-described
"impossible" scenario where the previous checkpoint record can't be
read, or if we're exiting archive recovery for some reason other than
a promotion request, or if we're in single-user mode, or if we're in
crash recovery. Presumably, people would like to start up the server
quickly in all of those scenarios, so the only reason not to use this
technology all the time is if we think it's safe in some scenarios and
not others. I can't think of a reason why it matters why we're exiting
archive recovery, nor can I think of a reason why it matters whether
we're in single user mode. The distinction between crash recovery and
archive recovery does seem to matter, but if anything the crash
recovery scenario seems simpler, because then there's only one
timeline involved.

I realize that conservatism may have played a role in this code ending
up looking the way that it does; someone seems to have thought it
would be better not to rely on a new idea in all cases. From my point
of view, though, it's scary to have so many cases, especially cases
that don't seem like they should ever be reached. I think that
simplifying the logic here and trying to do the same things in as many
cases as we can will lead to better robustness. Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
    CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: needless complexity in StartupXLOG

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> So at the moment I am leaning toward the view that we should just
> remove this check entirely, as in the attached, proposed patch.

Haven't dug in deeply but at least following your explanation and
reading over the patch and the code a bit, I tend to agree.

> Really, I think we should consider going further. If it's safe to
> write an end-of-recovery record rather than a checkpoint, why not do
> so all the time? Right now we fail to do that in the above-described
> "impossible" scenario where the previous checkpoint record can't be
> read, or if we're exiting archive recovery for some reason other than
> a promotion request, or if we're in single-user mode, or if we're in
> crash recovery. Presumably, people would like to start up the server
> quickly in all of those scenarios, so the only reason not to use this
> technology all the time is if we think it's safe in some scenarios and
> not others. I can't think of a reason why it matters why we're exiting
> archive recovery, nor can I think of a reason why it matters whether
> we're in single user mode. The distinction between crash recovery and
> archive recovery does seem to matter, but if anything the crash
> recovery scenario seems simpler, because then there's only one
> timeline involved.

Yeah, tend to agree with this too ... but something I find a bit curious
is the comment:

* Insert a special WAL record to mark the end of
* recovery, since we aren't doing a checkpoint.

... immediately after setting promoted = true, and then at the end of
StartupXLOG() having:

if (promoted)
    RequestCheckpoint(CHECKPOINT_FORCE);

maybe I'm missing something, but seems like that comment isn't being
terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
sure looks like we're going to do one moments later regardless of
anything else since we've set promoted to true..?

> I realize that conservatism may have played a role in this code ending
> up looking the way that it does; someone seems to have thought it
> would be better not to rely on a new idea in all cases. From my point
> of view, though, it's scary to have so many cases, especially cases
> that don't seem like they should ever be reached. I think that
> simplifying the logic here and trying to do the same things in as many
> cases as we can will lead to better robustness. Imagine if instead of
> all the hairy logic we have now we just replaced this whole if
> (IsInRecovery) stanza with this:
>
> if (InRecovery)
>     CreateEndOfRecoveryRecord();
>
> That would be WAY easier to reason about than the rat's nest we have
> here today. Now, I am not sure what it would take to get there, but I
> think that is the direction we ought to be heading.

Agreed that simpler logic is better, provided it's correct logic, of
course.  Finding better ways to test all of this would be really nice.

Thanks!

Stephen

Attachment

Re: needless complexity in StartupXLOG

From
Robert Haas
Date:
On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote:
> Yeah, tend to agree with this too ... but something I find a bit curious
> is the comment:
>
> * Insert a special WAL record to mark the end of
> * recovery, since we aren't doing a checkpoint.
>
> ... immediately after setting promoted = true, and then at the end of
> StartupXLOG() having:
>
> if (promoted)
>         RequestCheckpoint(CHECKPOINT_FORCE);
>
> maybe I'm missing something, but seems like that comment isn't being
> terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
> sure looks like we're going to do one moments later regardless of
> anything else since we've set promoted to true..?

Yep. So it's a question of whether we allow operations that might
write WAL in the meantime. When we write the checkpoint record right
here, there can't be any WAL from the new server lifetime until the
checkpoint completes. When we write an end-of-recovery record, there
can. And there could actually be quite a bit, because if we do the
checkpoint right in this section of code, it will be a fast
checkpoint, whereas in the code you quoted above, it's a spread
checkpoint, which takes a lot longer. So the question is whether it's
reasonable to give the checkpoint some time to complete or whether it
needs to be completed right now.

> Agreed that simpler logic is better, provided it's correct logic, of
> course.  Finding better ways to test all of this would be really nice.

Yeah, and there again, it's a lot easier to test if we don't have as
many cases. Now no single change is going to fix that, but the number
of flag variables in xlog.c is simply bonkers. This particular stretch
of code uses 3 of them to even decide whether to attempt the test in
question, and all of those are set in complex ways depending on the
values of still more flag variables. The comments where
bgwriterLaunched is set claim that we only do this during archive
recovery, not crash recovery, but the code depends on the value of
ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
can't get the bgwriter to run even during crash recovery in the
scenario described by:

         * It's possible that archive recovery was requested, but we don't
         * know how far we need to replay the WAL before we reach consistency.
         * This can happen for example if a base backup is taken from a
         * running server using an atomic filesystem snapshot, without calling
         * pg_start/stop_backup. Or if you just kill a running primary server
         * and put it into archive recovery by creating a recovery signal
         * file.

If we ran the bgwriter all the time during crash recovery, we'd know
for sure whether that causes any problems. If we only do it like this
in certain corner cases, it's much more likely that we have bugs.
Grumble, grumble.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: needless complexity in StartupXLOG

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Yeah, tend to agree with this too ... but something I find a bit curious
> > is the comment:
> >
> > * Insert a special WAL record to mark the end of
> > * recovery, since we aren't doing a checkpoint.
> >
> > ... immediately after setting promoted = true, and then at the end of
> > StartupXLOG() having:
> >
> > if (promoted)
> >         RequestCheckpoint(CHECKPOINT_FORCE);
> >
> > maybe I'm missing something, but seems like that comment isn't being
> > terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
> > sure looks like we're going to do one moments later regardless of
> > anything else since we've set promoted to true..?
>
> Yep. So it's a question of whether we allow operations that might
> write WAL in the meantime. When we write the checkpoint record right
> here, there can't be any WAL from the new server lifetime until the
> checkpoint completes. When we write an end-of-recovery record, there
> can. And there could actually be quite a bit, because if we do the
> checkpoint right in this section of code, it will be a fast
> checkpoint, whereas in the code you quoted above, it's a spread
> checkpoint, which takes a lot longer. So the question is whether it's
> reasonable to give the checkpoint some time to complete or whether it
> needs to be completed right now.

All I was really trying to point out above was that the comment might be
improved upon, just so someone understands that we aren't doing a
checkpoint at this particular place, but one will be done later due to
the promotion.  Maybe I'm being a bit extra with that, but that was my
thought when reading the code and the use of the promoted flag variable.

> > Agreed that simpler logic is better, provided it's correct logic, of
> > course.  Finding better ways to test all of this would be really nice.
>
> Yeah, and there again, it's a lot easier to test if we don't have as
> many cases. Now no single change is going to fix that, but the number
> of flag variables in xlog.c is simply bonkers. This particular stretch
> of code uses 3 of them to even decide whether to attempt the test in
> question, and all of those are set in complex ways depending on the
> values of still more flag variables. The comments where
> bgwriterLaunched is set claim that we only do this during archive
> recovery, not crash recovery, but the code depends on the value of
> ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
> can't get the bgwriter to run even during crash recovery in the
> scenario described by:
>
>          * It's possible that archive recovery was requested, but we don't
>          * know how far we need to replay the WAL before we reach consistency.
>          * This can happen for example if a base backup is taken from a
>          * running server using an atomic filesystem snapshot, without calling
>          * pg_start/stop_backup. Or if you just kill a running primary server
>          * and put it into archive recovery by creating a recovery signal
>          * file.
>
> If we ran the bgwriter all the time during crash recovery, we'd know
> for sure whether that causes any problems. If we only do it like this
> in certain corner cases, it's much more likely that we have bugs.
> Grumble, grumble.

Yeah ... not to mention that it really is just incredibly dangerous to
use such an approach for PITR.  For my 2c, I'd rather we figure out a
way to prevent this than to imply that we support it when we have no way
of knowing if we actually have replayed far enough to be consistent.
That isn't to say that using snapshots for database backups isn't
possible, but it should be done in-between pg_start/stop_backup calls
which properly grab the returned info from those and store the backup
label with the snapshot, etc.

Thanks,

Stephen

Attachment

Re: needless complexity in StartupXLOG

From
Justin Pryzby
Date:
On Mon, Jul 26, 2021 at 03:53:20PM -0400, Robert Haas wrote:
> Yeah, and there again, it's a lot easier to test if we don't have as
> many cases. Now no single change is going to fix that, but the number
> of flag variables in xlog.c is simply bonkers. This particular stretch
> of code uses 3 of them to even decide whether to attempt the test in
> question, and all of those are set in complex ways depending on the
> values of still more flag variables. The comments where
> bgwriterLaunched is set claim that we only do this during archive
> recovery, not crash recovery, but the code depends on the value of
> ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
> can't get the bgwriter to run even during crash recovery in the
> scenario described by:

I'm not following along closely and maybe you're already aware of this one?
https://commitfest.postgresql.org/33/2706/
Background writer and checkpointer in crash recovery

@Thomas:
https://www.postgresql.org/message-id/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com
>          * It's possible that archive recovery was requested, but we don't
>          * know how far we need to replay the WAL before we reach consistency.
>          * This can happen for example if a base backup is taken from a
>          * running server using an atomic filesystem snapshot, without calling
>          * pg_start/stop_backup. Or if you just kill a running primary server
>          * and put it into archive recovery by creating a recovery signal
>          * file.
> 
> If we ran the bgwriter all the time during crash recovery, we'd know
> for sure whether that causes any problems. If we only do it like this
> in certain corner cases, it's much more likely that we have bugs.
> Grumble, grumble.

-- 
Justin



Re: needless complexity in StartupXLOG

From
Robert Haas
Date:
On Mon, Jul 26, 2021 at 4:15 PM Stephen Frost <sfrost@snowman.net> wrote:
> All I was really trying to point out above was that the comment might be
> improved upon, just so someone understands that we aren't doing a
> checkpoint at this particular place, but one will be done later due to
> the promotion.  Maybe I'm being a bit extra with that, but that was my
> thought when reading the code and the use of the promoted flag variable.

Yeah, I agree, it confused me too, at first.

> Yeah ... not to mention that it really is just incredibly dangerous to
> use such an approach for PITR.  For my 2c, I'd rather we figure out a
> way to prevent this than to imply that we support it when we have no way
> of knowing if we actually have replayed far enough to be consistent.
> That isn't to say that using snapshots for database backups isn't
> possible, but it should be done in-between pg_start/stop_backup calls
> which properly grab the returned info from those and store the backup
> label with the snapshot, etc.

My position on that is that I would not particularly recommend the
technique described here, but I would not choose to try to block it
either. That's an argument for another thread, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: needless complexity in StartupXLOG

From
Kyotaro Horiguchi
Date:
At Mon, 26 Jul 2021 16:15:23 -0400, Stephen Frost <sfrost@snowman.net> wrote in 
> Greetings,
> 
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > Yeah, tend to agree with this too ... but something I find a bit curious
> > > is the comment:
> > >
> > > * Insert a special WAL record to mark the end of
> > > * recovery, since we aren't doing a checkpoint.
> > >
> > > ... immediately after setting promoted = true, and then at the end of
> > > StartupXLOG() having:
> > >
> > > if (promoted)
> > >         RequestCheckpoint(CHECKPOINT_FORCE);
> > >
> > > maybe I'm missing something, but seems like that comment isn't being
> > > terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
> > > sure looks like we're going to do one moments later regardless of
> > > anything else since we've set promoted to true..?
> > 
> > Yep. So it's a question of whether we allow operations that might
> > write WAL in the meantime. When we write the checkpoint record right
> > here, there can't be any WAL from the new server lifetime until the
> > checkpoint completes. When we write an end-of-recovery record, there
> > can. And there could actually be quite a bit, because if we do the
> > checkpoint right in this section of code, it will be a fast
> > checkpoint, whereas in the code you quoted above, it's a spread
> > checkpoint, which takes a lot longer. So the question is whether it's
> > reasonable to give the checkpoint some time to complete or whether it
> > needs to be completed right now.
> 
> All I was really trying to point out above was that the comment might be
> improved upon, just so someone understands that we aren't doing a
> checkpoint at this particular place, but one will be done later due to
> the promotion.  Maybe I'm being a bit extra with that, but that was my
> thought when reading the code and the use of the promoted flag variable.

(I feel we don't need to check for last-checkpoint, too.)

FWIW, by the way, I complained that the variable name "promoted" is a
bit confusing.  It's old name was fast_promoted, which means that fast
promotion is being *requsted* and ongoing.  On the other hand the
current name "promoted" still means "(fast=non-fallback) promotion is
ongoing" so there was a conversation as the follows.

https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com

>> How about changing it to fallback_promotion, or some names with more
>> behavior-specific name like immediate_checkpoint_needed?
> 
> I like doEndOfRecoveryCkpt or something, but I have no strong opinion
> about that flag naming. So I'm ok with immediate_checkpoint_needed
> if others also like that, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: needless complexity in StartupXLOG

From
Robert Haas
Date:
On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> FWIW, by the way, I complained that the variable name "promoted" is a
> bit confusing.  It's old name was fast_promoted, which means that fast
> promotion is being *requsted* and ongoing.  On the other hand the
> current name "promoted" still means "(fast=non-fallback) promotion is
> ongoing" so there was a conversation as the follows.
>
> https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com

I agree - that variable name is also not great. I am open to making
improvements in that area and in others that have been suggested on
this thread, but my immediate goal is to figure out whether anyone
objects to me committing the posted patch. If nobody comes up with a
reason why it's a bad idea in the next few days, I'll plan to move
ahead with it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: needless complexity in StartupXLOG

From
Kyotaro Horiguchi
Date:
At Tue, 27 Jul 2021 11:03:14 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > FWIW, by the way, I complained that the variable name "promoted" is a
> > bit confusing.  It's old name was fast_promoted, which means that fast
> > promotion is being *requsted* and ongoing.  On the other hand the
> > current name "promoted" still means "(fast=non-fallback) promotion is
> > ongoing" so there was a conversation as the follows.
> >
> > https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com
> 
> I agree - that variable name is also not great. I am open to making
> improvements in that area and in others that have been suggested on
> this thread, but my immediate goal is to figure out whether anyone
> objects to me committing the posted patch. If nobody comes up with a
> reason why it's a bad idea in the next few days, I'll plan to move
> ahead with it.

That's fine with me.

I still haven't find a way to lose the last checkpoint due to software
failure.  Repeated promotion without having new checkpoints is safe as
expected. We don't remove WAL files unless a checkpoint completes, and
a checkpoint preserves segments back to the one containing its redo
point.

In short, I'm for it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: needless complexity in StartupXLOG

From
Andres Freund
Date:
Hi,

On 2021-07-26 12:12:53 -0400, Robert Haas wrote:
> My first thought was that we should do the check unconditionally,
> rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
> ERROR if it fails. But then I wondered what the point of that would be
> exactly. If we have such a bug -- and to the best of my knowledge
> there's no evidence that we do -- there's no particular reason it
> should only happen at the end of recovery. It could happen any time
> the system -- or the user, or malicious space aliens -- remove files
> from pg_wal, and we have no real idea about the timing of malicious
> space alien activity, so doing the test here rather than anywhere else
> just seems like a shot in the dark.

Yea. The history of that code being added doesn't suggest that there was
a concrete issue being addressed, from what I can tell.


> So at the moment I am leaning toward the view that we should just
> remove this check entirely, as in the attached, proposed patch.

+1


> Really, I think we should consider going further. If it's safe to
> write an end-of-recovery record rather than a checkpoint, why not do
> so all the time?

+many. The current split doesn't make much sense. For one, it often is a huge
issue if crash recovery takes a long time - why should we incur the cost that
we are OK avoiding during promotions? For another, end-of-recovery is a
crucial path for correctness, reducing the number of non-trivial paths is
good.


> Imagine if instead of
> all the hairy logic we have now we just replaced this whole if
> (IsInRecovery) stanza with this:
> 
> if (InRecovery)
>     CreateEndOfRecoveryRecord();
> 
> That would be WAY easier to reason about than the rat's nest we have
> here today. Now, I am not sure what it would take to get there, but I
> think that is the direction we ought to be heading.

What are we going to do in the single user ([1]) case in this awesome future?
I guess we could just not create a checkpoint until single user mode is shut
down / creates a checkpoint for other reasons?


Greetings,

Andres Freund


[1] I really wish somebody had the energy to just remove single user and
bootstrap modes. The degree to which they increase complexity in the rest of
the system is entirely unreasonable. There's not actually any reason
bootstrapping can't happen with checkpointer et al running, it's just
autovacuum that'd need to be blocked.



Re: needless complexity in StartupXLOG

From
Julien Rouhaud
Date:
On Thu, Jul 29, 2021 at 3:28 AM Andres Freund <andres@anarazel.de> wrote:
>
> [1] I really wish somebody had the energy to just remove single user and
> bootstrap modes. The degree to which they increase complexity in the rest of
> the system is entirely unreasonable. There's not actually any reason
> bootstrapping can't happen with checkpointer et al running, it's just
> autovacuum that'd need to be blocked.

Any objection to adding an entry for that in the wiki TODO?



Re: needless complexity in StartupXLOG

From
Amul Sul
Date:
On Mon, Jul 26, 2021 at 9:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> StartupXLOG() has code beginning around line 7900 of xlog.c that
> decides, at the end of recovery, between four possible courses of
> action. It either writes an end-of-recovery record, or requests a
> checkpoint, or creates a checkpoint, or does nothing, depending on the
> value of 3 flag variables, and also on whether we're still able to
> read the last checkpoint record:
>
>                 checkPointLoc = ControlFile->checkPoint;
>
>                 /*
>                  * Confirm the last checkpoint is available for us to recover
>                  * from if we fail.
>                  */
>                 record = ReadCheckpointRecord(xlogreader,
> checkPointLoc, 1, false);
>                 if (record != NULL)
>                 {
>                     promoted = true;
>
> It seems to me that ReadCheckpointRecord() should never fail here. It
> should always be true, even when we're not in recovery, that the last
> checkpoint record is readable. If there's ever a situation where that
> is not true, even for an instant, then a crash at that point will be
> unrecoverable. Now, one way that it could happen is if we've got a bug
> in the code someplace that removes WAL segments too soon. However, if
> we have such a bug, we should fix it. What this code does is says "oh,
> I guess we removed the old checkpoint too soon, no big deal, let's
> just be more aggressive about getting the next one done," which I do
> not think is the right response. Aside from a bug, the only other way
> I can see it happening is if someone is manually removing WAL segments
> as the server is running through recovery, perhaps as some last-ditch
> play to avoid running out of disk space. I don't think the server
> needs to have - or should have - code to cater to such crazy
> scenarios. Therefore I think that this check, at least in its current
> form, is not sensible.
>
> My first thought was that we should do the check unconditionally,
> rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
> ERROR if it fails. But then I wondered what the point of that would be
> exactly. If we have such a bug -- and to the best of my knowledge
> there's no evidence that we do -- there's no particular reason it
> should only happen at the end of recovery. It could happen any time
> the system -- or the user, or malicious space aliens -- remove files
> from pg_wal, and we have no real idea about the timing of malicious
> space alien activity, so doing the test here rather than anywhere else
> just seems like a shot in the dark. Perhaps the most logical place
> would be to move it to CreateCheckPoint() just after we remove old
> xlog files, but we don't have the xlogreader there, and anyway I don't
> see how it's really going to help. What bug do we expect to catch by
> removing files we think we don't need and then checking that we didn't
> remove the files we think we do need? That seems more like grasping at
> straws than a serious attempt to make things work better.
>
> So at the moment I am leaning toward the view that we should just
> remove this check entirely, as in the attached, proposed patch.
>

Can we have an elog() fatal error or warning to make sure that the
last checkpoint is still readable? Since the case where the user
(knowingly or unknowingly) or some buggy code has removed the WAL file
containing the last checkpoint could be possible. If it is then we
would have a hard time finding out when we get further unexpected
behavior due to this. Thoughts?

> Really, I think we should consider going further. If it's safe to
> write an end-of-recovery record rather than a checkpoint, why not do
> so all the time? Right now we fail to do that in the above-described
> "impossible" scenario where the previous checkpoint record can't be
> read, or if we're exiting archive recovery for some reason other than
> a promotion request, or if we're in single-user mode, or if we're in
> crash recovery. Presumably, people would like to start up the server
> quickly in all of those scenarios, so the only reason not to use this
> technology all the time is if we think it's safe in some scenarios and
> not others. I can't think of a reason why it matters why we're exiting
> archive recovery, nor can I think of a reason why it matters whether
> we're in single user mode. The distinction between crash recovery and
> archive recovery does seem to matter, but if anything the crash
> recovery scenario seems simpler, because then there's only one
> timeline involved.
>
> I realize that conservatism may have played a role in this code ending
> up looking the way that it does; someone seems to have thought it
> would be better not to rely on a new idea in all cases. From my point
> of view, though, it's scary to have so many cases, especially cases
> that don't seem like they should ever be reached. I think that
> simplifying the logic here and trying to do the same things in as many
> cases as we can will lead to better robustness. Imagine if instead of
> all the hairy logic we have now we just replaced this whole if
> (IsInRecovery) stanza with this:
>
> if (InRecovery)
>     CreateEndOfRecoveryRecord();

+1, and do the checkpoint at the end unconditionally as we are doing
for the promotion.

Regards,
Amul



Re: needless complexity in StartupXLOG

From
Robert Haas
Date:
On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:
> > Imagine if instead of
> > all the hairy logic we have now we just replaced this whole if
> > (IsInRecovery) stanza with this:
> >
> > if (InRecovery)
> >     CreateEndOfRecoveryRecord();
> >
> > That would be WAY easier to reason about than the rat's nest we have
> > here today. Now, I am not sure what it would take to get there, but I
> > think that is the direction we ought to be heading.
>
> What are we going to do in the single user ([1]) case in this awesome future?
> I guess we could just not create a checkpoint until single user mode is shut
> down / creates a checkpoint for other reasons?

It probably depends on who writes this thus-far-hypothetical patch,
but my thought is that we'd unconditionally request a checkpoint after
writing the end-of-recovery record, same as we do now if (promoted).
If we happened to be in single-user mode, then that checkpoint request
would turn into performing a checkpoint on the spot in the one and
only process we've got, but StartupXLOG() wouldn't really need to care
what happens under the hood after it called RequestCheckpoint().

> [1] I really wish somebody had the energy to just remove single user and
> bootstrap modes. The degree to which they increase complexity in the rest of
> the system is entirely unreasonable. There's not actually any reason
> bootstrapping can't happen with checkpointer et al running, it's just
> autovacuum that'd need to be blocked.

I don't know whether this is the way forward or not. I think a lot of
the complexity of the current system is incidental rather than
intrinsic. If I were going to work on this, I'd probably working on
trying to tidy up the code and reduce the number of places that need
to care about IsUnderPostmaster and IsPostmasterEnvironment, rather
than trying to get rid of them. I suspect that's a necessary
prerequisite step anyway, and not a trivial effort either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: needless complexity in StartupXLOG

From
Robert Haas
Date:
On Thu, Jul 29, 2021 at 1:47 AM Amul Sul <sulamul@gmail.com> wrote:
> Can we have an elog() fatal error or warning to make sure that the
> last checkpoint is still readable? Since the case where the user
> (knowingly or unknowingly) or some buggy code has removed the WAL file
> containing the last checkpoint could be possible. If it is then we
> would have a hard time finding out when we get further unexpected
> behavior due to this. Thoughts?

Sure, we could, but I don't think we should. Such crazy things can
happen any time, not just at the point where this check is happening.
It's not particularly more likely to happen here vs. any other place
where we could insert a check. Should we check everywhere, all the
time, just in case?

> > I realize that conservatism may have played a role in this code ending
> > up looking the way that it does; someone seems to have thought it
> > would be better not to rely on a new idea in all cases. From my point
> > of view, though, it's scary to have so many cases, especially cases
> > that don't seem like they should ever be reached. I think that
> > simplifying the logic here and trying to do the same things in as many
> > cases as we can will lead to better robustness. Imagine if instead of
> > all the hairy logic we have now we just replaced this whole if
> > (IsInRecovery) stanza with this:
> >
> > if (InRecovery)
> >     CreateEndOfRecoveryRecord();
>
> +1, and do the checkpoint at the end unconditionally as we are doing
> for the promotion.

Yeah, that was my thought, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: needless complexity in StartupXLOG

From
Andres Freund
Date:
Hi,

On 2021-07-29 12:49:19 +0800, Julien Rouhaud wrote:
> On Thu, Jul 29, 2021 at 3:28 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > [1] I really wish somebody had the energy to just remove single user and
> > bootstrap modes. The degree to which they increase complexity in the rest of
> > the system is entirely unreasonable. There's not actually any reason
> > bootstrapping can't happen with checkpointer et al running, it's just
> > autovacuum that'd need to be blocked.
> 
> Any objection to adding an entry for that in the wiki TODO?

Not sure there's enough concensus on the idea for that. I personally
think that's a good approach at reducing relevant complexity, but I
don't know if anybody agrees...

Greetings,

Andres Freund



Re: needless complexity in StartupXLOG

From
Robert Haas
Date:
On Thu, Jul 29, 2021 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
> Not sure there's enough concensus on the idea for that. I personally
> think that's a good approach at reducing relevant complexity, but I
> don't know if anybody agrees...

There does seem to be agreement on the proposed patch, so I have committed it.

Thanks to all for the discussion and the links to other relevant threads.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



using an end-of-recovery record in all cases

From
Robert Haas
Date:
On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:
> > > Imagine if instead of
> > > all the hairy logic we have now we just replaced this whole if
> > > (IsInRecovery) stanza with this:
> > >
> > > if (InRecovery)
> > >     CreateEndOfRecoveryRecord();
> > >
> > > That would be WAY easier to reason about than the rat's nest we have
> > > here today. Now, I am not sure what it would take to get there, but I
> > > think that is the direction we ought to be heading.
> >
> > What are we going to do in the single user ([1]) case in this awesome future?
> > I guess we could just not create a checkpoint until single user mode is shut
> > down / creates a checkpoint for other reasons?
>
> It probably depends on who writes this thus-far-hypothetical patch,
> but my thought is that we'd unconditionally request a checkpoint after
> writing the end-of-recovery record, same as we do now if (promoted).
> If we happened to be in single-user mode, then that checkpoint request
> would turn into performing a checkpoint on the spot in the one and
> only process we've got, but StartupXLOG() wouldn't really need to care
> what happens under the hood after it called RequestCheckpoint().

I decided to try writing a patch to use an end-of-recovery record
rather than a checkpoint record in all cases. The patch itself was
pretty simple but didn't work. There are two different reasons why it
didn't work, which I'll explain in a minute. I'm not sure whether
there are any other problems; these are the only two things that cause
problems with 'make check-world', but that's hardly a guarantee of
anything. Anyway, I thought it would be useful to report these issues
first and hopefully get some feedback.

The first problem I hit was that GetRunningTransactionData() does
Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
While that does seem like a superficially reasonable thing to assert,
StartupXLOG() initializes latestCompletedXid by calling
TransactionIdRetreat on the value extracted from checkPoint.nextXid,
and the first-ever checkpoint that is written at the beginning of WAL
has a nextXid of 3, so when starting up from that checkpoint nextXid
is 2, which is not a normal XID. When we try to create the next
checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls
GetRunningTransactionData() and the assertion fails. In the current
code, we avoid this more or less accidentally because
LogStandbySnapshot() is skipped when starting from a shutdown
checkpoint. If we write an end-of-recovery record and then trigger a
checkpoint afterwards, then we don't avoid the problem. Although I'm
impishly tempted to suggest that we #define SecondNormalTransactionId
4 and then use that to create the first checkpoint instead of
FirstNormalTransactionId -- naturally with no comments explaining why
we're doing it -- I think the real problem is that the assertion is
just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be
InvalidTransactionId or BootstrapTransactionId, but
FrozenTransactionId is a legal, if corner-case, value, at least as the
code stands today.

Unfortunately we can't just relax the assertion, because the
XLOG_RUNNING_XACTS record will eventually be handed to
ProcArrayApplyRecoveryInfo() for processing ... and that function
contains a matching assertion which would in turn fail. It in turn
passes the value to MaintainLatestCompletedXidRecovery() which
contains yet another matching assertion, so the restriction to normal
XIDs here looks pretty deliberate. There are no comments, though, so
the reader is left to guess why. I see one problem:
MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
expects a normal XID. Perhaps it's best to just dodge the entire issue
by skipping LogStandbySnapshot() if latestCompletedXid happens to be
2, but that feels like a hack, because AFAICS the real problem is that
StartupXLog() doesn't agree with the rest of the code on whether 2 is
a legal case, and maybe we ought to be storing a value that doesn't
need to be computed via TransactionIdRetreat().

The second problem I hit was a preexisting bug where, under
wal_level=minimal, redo of a "create tablespace" record can blow away
data of which we have no other copy. See
http://postgr.es/m/CA+TgmoaLO9ncuwvr2nN-J4VEP5XyAcy=zKiHxQzBbFRxxGxm0w@mail.gmail.com
for details. That bug happens to make src/test/recovery's
018_wal_optimize.pl fail one of the tests, because the test starts and
stops the server repeatedly, with the result that with the attached
patch, we just keep writing end-of-recovery records and never getting
time to finish a checkpoint before the next shutdown, so every test
replays the CREATE TABLESPACE record and everything that previous
tests did. The "wal_level = minimal, SET TABLESPACE commit
subtransaction" fails because it's the only one that (1) uses the
tablespace for a new table, (2) commits, and (3) runs before a
checkpoint is manually forced.

It's also worth noting that if we go this way,
CHECKPOINT_END_OF_RECOVERY should be ripped out entirely. We'd still
be triggering a checkpoint at the end of recovery, but because it
could be running concurrently with WAL-generating activity, it
wouldn't be an end-of-recovery checkpoint in the sense that we now use
that term. In particular, you couldn't assume that no other write
transactions are running at the point when this checkpoint is
performed. I haven't yet tried ripping that out and doing so might
reveal other problems.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: using an end-of-recovery record in all cases

From
Robert Haas
Date:
On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I decided to try writing a patch to use an end-of-recovery record
> rather than a checkpoint record in all cases.
>
> The first problem I hit was that GetRunningTransactionData() does
> Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
>
> Unfortunately we can't just relax the assertion, because the
> XLOG_RUNNING_XACTS record will eventually be handed to
> ProcArrayApplyRecoveryInfo() for processing ... and that function
> contains a matching assertion which would in turn fail. It in turn
> passes the value to MaintainLatestCompletedXidRecovery() which
> contains yet another matching assertion, so the restriction to normal
> XIDs here looks pretty deliberate. There are no comments, though, so
> the reader is left to guess why. I see one problem:
> MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
> expects a normal XID. Perhaps it's best to just dodge the entire issue
> by skipping LogStandbySnapshot() if latestCompletedXid happens to be
> 2, but that feels like a hack, because AFAICS the real problem is that
> StartupXLog() doesn't agree with the rest of the code on whether 2 is
> a legal case, and maybe we ought to be storing a value that doesn't
> need to be computed via TransactionIdRetreat().

Anyone have any thoughts about this?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: using an end-of-recovery record in all cases

From
Kyotaro Horiguchi
Date:
At Thu, 2 Sep 2021 11:30:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I decided to try writing a patch to use an end-of-recovery record
> > rather than a checkpoint record in all cases.
> >
> > The first problem I hit was that GetRunningTransactionData() does
> > Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
> >
> > Unfortunately we can't just relax the assertion, because the
> > XLOG_RUNNING_XACTS record will eventually be handed to
> > ProcArrayApplyRecoveryInfo() for processing ... and that function
> > contains a matching assertion which would in turn fail. It in turn
> > passes the value to MaintainLatestCompletedXidRecovery() which
> > contains yet another matching assertion, so the restriction to normal
> > XIDs here looks pretty deliberate. There are no comments, though, so
> > the reader is left to guess why. I see one problem:
> > MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
> > expects a normal XID. Perhaps it's best to just dodge the entire issue
> > by skipping LogStandbySnapshot() if latestCompletedXid happens to be
> > 2, but that feels like a hack, because AFAICS the real problem is that
> > StartupXLog() doesn't agree with the rest of the code on whether 2 is
> > a legal case, and maybe we ought to be storing a value that doesn't
> > need to be computed via TransactionIdRetreat().
> 
> Anyone have any thoughts about this?

I tried to reproduce this but just replacing the end-of-recovery
checkpoint request with issuing an end-of-recovery record didn't cause
make check-workd fail for me.  Do you have an idea of any other
requirement to cause that?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: using an end-of-recovery record in all cases

From
Amul Sul
Date:
On Fri, Sep 3, 2021 at 10:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 2 Sep 2021 11:30:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
> > On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I decided to try writing a patch to use an end-of-recovery record
> > > rather than a checkpoint record in all cases.
> > >
> > > The first problem I hit was that GetRunningTransactionData() does
> > > Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
> > >
> > > Unfortunately we can't just relax the assertion, because the
> > > XLOG_RUNNING_XACTS record will eventually be handed to
> > > ProcArrayApplyRecoveryInfo() for processing ... and that function
> > > contains a matching assertion which would in turn fail. It in turn
> > > passes the value to MaintainLatestCompletedXidRecovery() which
> > > contains yet another matching assertion, so the restriction to normal
> > > XIDs here looks pretty deliberate. There are no comments, though, so
> > > the reader is left to guess why. I see one problem:
> > > MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
> > > expects a normal XID. Perhaps it's best to just dodge the entire issue
> > > by skipping LogStandbySnapshot() if latestCompletedXid happens to be
> > > 2, but that feels like a hack, because AFAICS the real problem is that
> > > StartupXLog() doesn't agree with the rest of the code on whether 2 is
> > > a legal case, and maybe we ought to be storing a value that doesn't
> > > need to be computed via TransactionIdRetreat().
> >
> > Anyone have any thoughts about this?
>
> I tried to reproduce this but just replacing the end-of-recovery
> checkpoint request with issuing an end-of-recovery record didn't cause
> make check-workd fail for me.  Do you have an idea of any other
> requirement to cause that?
>

You might need the following change at the end of StartupXLOG():

- if (promoted)
- RequestCheckpoint(CHECKPOINT_FORCE);
+ RequestCheckpoint(CHECKPOINT_FORCE);

Regards,
Amul



Re: using an end-of-recovery record in all cases

From
Robert Haas
Date:
On Fri, Sep 3, 2021 at 12:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> I tried to reproduce this but just replacing the end-of-recovery
> checkpoint request with issuing an end-of-recovery record didn't cause
> make check-workd fail for me.  Do you have an idea of any other
> requirement to cause that?

Did you use the patch I posted, or a different one?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: using an end-of-recovery record in all cases

From
Amit Kapila
Date:
On Tue, Aug 10, 2021 at 12:31 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:
> > > > Imagine if instead of
> > > > all the hairy logic we have now we just replaced this whole if
> > > > (IsInRecovery) stanza with this:
> > > >
> > > > if (InRecovery)
> > > >     CreateEndOfRecoveryRecord();
> > > >
> > > > That would be WAY easier to reason about than the rat's nest we have
> > > > here today. Now, I am not sure what it would take to get there, but I
> > > > think that is the direction we ought to be heading.
> > >
> > > What are we going to do in the single user ([1]) case in this awesome future?
> > > I guess we could just not create a checkpoint until single user mode is shut
> > > down / creates a checkpoint for other reasons?
> >
> > It probably depends on who writes this thus-far-hypothetical patch,
> > but my thought is that we'd unconditionally request a checkpoint after
> > writing the end-of-recovery record, same as we do now if (promoted).
> > If we happened to be in single-user mode, then that checkpoint request
> > would turn into performing a checkpoint on the spot in the one and
> > only process we've got, but StartupXLOG() wouldn't really need to care
> > what happens under the hood after it called RequestCheckpoint().
>
> I decided to try writing a patch to use an end-of-recovery record
> rather than a checkpoint record in all cases. The patch itself was
> pretty simple but didn't work. There are two different reasons why it
> didn't work, which I'll explain in a minute. I'm not sure whether
> there are any other problems; these are the only two things that cause
> problems with 'make check-world', but that's hardly a guarantee of
> anything. Anyway, I thought it would be useful to report these issues
> first and hopefully get some feedback.
>
> The first problem I hit was that GetRunningTransactionData() does
> Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
> While that does seem like a superficially reasonable thing to assert,
> StartupXLOG() initializes latestCompletedXid by calling
> TransactionIdRetreat on the value extracted from checkPoint.nextXid,
> and the first-ever checkpoint that is written at the beginning of WAL
> has a nextXid of 3, so when starting up from that checkpoint nextXid
> is 2, which is not a normal XID. When we try to create the next
> checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls
> GetRunningTransactionData() and the assertion fails. In the current
> code, we avoid this more or less accidentally because
> LogStandbySnapshot() is skipped when starting from a shutdown
> checkpoint. If we write an end-of-recovery record and then trigger a
> checkpoint afterwards, then we don't avoid the problem. Although I'm
> impishly tempted to suggest that we #define SecondNormalTransactionId
> 4 and then use that to create the first checkpoint instead of
> FirstNormalTransactionId -- naturally with no comments explaining why
> we're doing it -- I think the real problem is that the assertion is
> just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be
> InvalidTransactionId or BootstrapTransactionId, but
> FrozenTransactionId is a legal, if corner-case, value, at least as the
> code stands today.
>
> Unfortunately we can't just relax the assertion, because the
> XLOG_RUNNING_XACTS record will eventually be handed to
> ProcArrayApplyRecoveryInfo() for processing ... and that function
> contains a matching assertion which would in turn fail. It in turn
> passes the value to MaintainLatestCompletedXidRecovery() which
> contains yet another matching assertion, so the restriction to normal
> XIDs here looks pretty deliberate. There are no comments, though, so
> the reader is left to guess why. I see one problem:
> MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
> expects a normal XID. Perhaps it's best to just dodge the entire issue
> by skipping LogStandbySnapshot() if latestCompletedXid happens to be
> 2,
>

By reading above explanation, it seems like it is better to skip
LogStandbySnapshot() for this proposal. Can't we consider skipping
LogStandbySnapshot() by passing some explicit flag-like
'recovery_checkpoint' or something like that?

I think there is some prior discussion in another thread related to
this work but it would be easier to follow if you can summarize the
same.


With Regards,
Amit Kapila.