Thread: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

[HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
Hi,

Since an optional second argument wait_for_archive of pg_stop_backup
has been  introduced in PostgreSQL 10 we can choose whether wait for
archiving. But my colleagues found that we can do pg_stop_backup with
wait_for_archive = true on the standby server but it actually doesn't
wait for WAL archiving. Because this behavior is not documented and we
cannot find out it without reading source code it will confuse the
user.

I think we can raise an error when pg_stop_backup with
wait_for_archive = true is executed on the standby. Attached patch
change it so that.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Magnus Hagander
Date:


On Thu, Jun 22, 2017 at 10:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi,

Since an optional second argument wait_for_archive of pg_stop_backup
has been  introduced in PostgreSQL 10 we can choose whether wait for
archiving. But my colleagues found that we can do pg_stop_backup with
wait_for_archive = true on the standby server but it actually doesn't
wait for WAL archiving. Because this behavior is not documented and we
cannot find out it without reading source code it will confuse the
user.

I think we can raise an error when pg_stop_backup with
wait_for_archive = true is executed on the standby. Attached patch
change it so that.

Wouldn't it be better to make it *work*? If you have archive_mode=always, it makes sense to want to wait on the standby as well, does it not?
 
--

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Thu, Jun 22, 2017 at 10:36 PM, Magnus Hagander <magnus@hagander.net> wrote:
>
>
> On Thu, Jun 22, 2017 at 10:12 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> Hi,
>>
>> Since an optional second argument wait_for_archive of pg_stop_backup
>> has been  introduced in PostgreSQL 10 we can choose whether wait for
>> archiving. But my colleagues found that we can do pg_stop_backup with
>> wait_for_archive = true on the standby server but it actually doesn't
>> wait for WAL archiving. Because this behavior is not documented and we
>> cannot find out it without reading source code it will confuse the
>> user.
>>
>> I think we can raise an error when pg_stop_backup with
>> wait_for_archive = true is executed on the standby. Attached patch
>> change it so that.
>
>
> Wouldn't it be better to make it *work*? If you have archive_mode=always, it
> makes sense to want to wait on the standby as well, does it not?
>

Yes, ideally it will be better to make it wait for WAL archiving on
standby server when archive_mode=always. But I think it would be for
PG11 item, and this item is for PG10.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Magnus Hagander
Date:
On Thu, Jun 22, 2017 at 6:22 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jun 22, 2017 at 10:36 PM, Magnus Hagander <magnus@hagander.net> wrote:
>
>
> On Thu, Jun 22, 2017 at 10:12 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> Hi,
>>
>> Since an optional second argument wait_for_archive of pg_stop_backup
>> has been  introduced in PostgreSQL 10 we can choose whether wait for
>> archiving. But my colleagues found that we can do pg_stop_backup with
>> wait_for_archive = true on the standby server but it actually doesn't
>> wait for WAL archiving. Because this behavior is not documented and we
>> cannot find out it without reading source code it will confuse the
>> user.
>>
>> I think we can raise an error when pg_stop_backup with
>> wait_for_archive = true is executed on the standby. Attached patch
>> change it so that.
>
>
> Wouldn't it be better to make it *work*? If you have archive_mode=always, it
> makes sense to want to wait on the standby as well, does it not?
>

Yes, ideally it will be better to make it wait for WAL archiving on
standby server when archive_mode=always. But I think it would be for
PG11 item, and this item is for PG10.


I'm not sure. I think this can be considered a bug in the implementation for 10, and as such is "open for fixing". However, it's not a very critical bug so I doubt it should be a release blocker, but if someone wants to work on a fix I think we should commit it. 

--

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Thu, Jun 29, 2017 at 10:30 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Thu, Jun 22, 2017 at 6:22 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Thu, Jun 22, 2017 at 10:36 PM, Magnus Hagander <magnus@hagander.net>
>> wrote:
>> >
>> >
>> > On Thu, Jun 22, 2017 at 10:12 AM, Masahiko Sawada
>> > <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Since an optional second argument wait_for_archive of pg_stop_backup
>> >> has been  introduced in PostgreSQL 10 we can choose whether wait for
>> >> archiving. But my colleagues found that we can do pg_stop_backup with
>> >> wait_for_archive = true on the standby server but it actually doesn't
>> >> wait for WAL archiving. Because this behavior is not documented and we
>> >> cannot find out it without reading source code it will confuse the
>> >> user.
>> >>
>> >> I think we can raise an error when pg_stop_backup with
>> >> wait_for_archive = true is executed on the standby. Attached patch
>> >> change it so that.
>> >
>> >
>> > Wouldn't it be better to make it *work*? If you have
>> > archive_mode=always, it
>> > makes sense to want to wait on the standby as well, does it not?
>> >
>>
>> Yes, ideally it will be better to make it wait for WAL archiving on
>> standby server when archive_mode=always. But I think it would be for
>> PG11 item, and this item is for PG10.
>>
>
> I'm not sure. I think this can be considered a bug in the implementation for
> 10, and as such is "open for fixing". However, it's not a very critical bug
> so I doubt it should be a release blocker, but if someone wants to work on a
> fix I think we should commit it.
>

I agree with you. I'd like to hear opinions from other hackers as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Peter Eisentraut
Date:
On 6/30/17 04:08, Masahiko Sawada wrote:
>> I'm not sure. I think this can be considered a bug in the implementation for
>> 10, and as such is "open for fixing". However, it's not a very critical bug
>> so I doubt it should be a release blocker, but if someone wants to work on a
>> fix I think we should commit it.
> 
> I agree with you. I'd like to hear opinions from other hackers as well.

It's preferable to make it work.  If it's not easily possible, then we
should prohibit it.

Comments from Stephen (original committer)?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Peter, all,

* Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
> On 6/30/17 04:08, Masahiko Sawada wrote:
> >> I'm not sure. I think this can be considered a bug in the implementation for
> >> 10, and as such is "open for fixing". However, it's not a very critical bug
> >> so I doubt it should be a release blocker, but if someone wants to work on a
> >> fix I think we should commit it.
> >
> > I agree with you. I'd like to hear opinions from other hackers as well.
>
> It's preferable to make it work.  If it's not easily possible, then we
> should prohibit it.
>
> Comments from Stephen (original committer)?

I agree that it'd be preferable to make it work, but I'm not sure I can
commit to having it done in short order.  I'm happy to work to prohibit
it, but if someone has a few spare cycles to make it actually work,
that'd be great.

In short, I agree with Magnus and feel like I'm more-or-less in the same
boat as he is (though slightly jealous as that's not actually physically
the case, for I hear he has a rather nice boat...).

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Sat, Jul 1, 2017 at 3:59 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
>> On 6/30/17 04:08, Masahiko Sawada wrote:
>> >> I'm not sure. I think this can be considered a bug in the implementation for
>> >> 10, and as such is "open for fixing". However, it's not a very critical bug
>> >> so I doubt it should be a release blocker, but if someone wants to work on a
>> >> fix I think we should commit it.
>> >
>> > I agree with you. I'd like to hear opinions from other hackers as well.
>>
>> It's preferable to make it work.  If it's not easily possible, then we
>> should prohibit it.
>>
>> Comments from Stephen (original committer)?
>
> I agree that it'd be preferable to make it work, but I'm not sure I can
> commit to having it done in short order.  I'm happy to work to prohibit
> it, but if someone has a few spare cycles to make it actually work,
> that'd be great.

Fixing the limitation instead of prohibiting it looks like a better
way of doing things to me. It would be hard to explain to users why
the implementation does not consider archive_mode = always. Blocking
it is just four lines of code, still that feels wrong.

> In short, I agree with Magnus and feel like I'm more-or-less in the same
> boat as he is (though slightly jealous as that's not actually physically
> the case, for I hear he has a rather nice boat...).

That means a PG-EU in Sweden at some point?!
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Sun, Jul 2, 2017 at 4:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Jul 1, 2017 at 3:59 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
>>> On 6/30/17 04:08, Masahiko Sawada wrote:
>>> >> I'm not sure. I think this can be considered a bug in the implementation for
>>> >> 10, and as such is "open for fixing". However, it's not a very critical bug
>>> >> so I doubt it should be a release blocker, but if someone wants to work on a
>>> >> fix I think we should commit it.
>>> >
>>> > I agree with you. I'd like to hear opinions from other hackers as well.
>>>
>>> It's preferable to make it work.  If it's not easily possible, then we
>>> should prohibit it.
>>>
>>> Comments from Stephen (original committer)?
>>
>> I agree that it'd be preferable to make it work, but I'm not sure I can
>> commit to having it done in short order.  I'm happy to work to prohibit
>> it, but if someone has a few spare cycles to make it actually work,
>> that'd be great.
>
> Fixing the limitation instead of prohibiting it looks like a better
> way of doing things to me. It would be hard to explain to users why
> the implementation does not consider archive_mode = always. Blocking
> it is just four lines of code, still that feels wrong.

I feel that since we cannot switch the WAL forcibly on the standby
server we need to find a new way to do so. I'm not sure but it might
be a hard work and be late for PG10. Or you meant that you have a idea
for this?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Wed, Jul 5, 2017 at 10:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I feel that since we cannot switch the WAL forcibly on the standby
> server we need to find a new way to do so. I'm not sure but it might
> be a hard work and be late for PG10. Or you meant that you have a idea
> for this?

Why not refactoring a bit do_pg_stop_backup() so as the wait phase
happens even if a backup is started in recovery? Now wait_for_archive
is ignored because no wait is happening and the stop point is directly
returned back to the caller. For the wait actually happening, I don't
have a better idea than documenting the fact that enforcing manually a
segment switch on the primary needs to happen. That's better than
having users including WAL in their base backups but not actually
having everything they need. And I think that documenting that
properly is better than restricting things that should work.

In most workloads, multiple WAL segments can be generated per second,
and in even more of them a new segment generated would happen in less
than a minute, so waiting for a segment switch on the primary should
not be a problem for most users. The log letting user know about the
wait should be more informative when things happen on a standby, like
"waiting for segment to be finished or switched on the primary".

If the restriction approach is preferred, I think that the check
should happen in do_pg_stop_backup as well, and not in
pg_stop_backup_v2 as your patch suggests. pg_basebackup is not able to
do non-exclusive backups but this may happen some day, who knows..
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Wed, Jul 5, 2017 at 4:57 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Why not refactoring a bit do_pg_stop_backup() so as the wait phase
> happens even if a backup is started in recovery? Now wait_for_archive
> is ignored because no wait is happening and the stop point is directly
> returned back to the caller. For the wait actually happening, I don't
> have a better idea than documenting the fact that enforcing manually a
> segment switch on the primary needs to happen. That's better than
> having users including WAL in their base backups but not actually
> having everything they need. And I think that documenting that
> properly is better than restricting things that should work.

While looking at that in more details, I got surprised by two things:
1) No backup history file is generated on a standby during a base backup.
2) Because of 1), those files are not archived even if archive_mode = always.

This sounds to me like a missing optimization of archive_mode =
always, and the following comment block in xlog.c is at least
incorrect as an archiver can be invoked in recovery:    * XXX currently a backup history file is for informational and
debug   * purposes only. It's not essential for an online backup. Furthermore,    * even if it's created, it will not
bearchived during recovery because    * an archiver is not invoked. So it doesn't seem worthwhile to write a    *
backuphistory file during recovery.
 

So I would suggest the following things to address this issue:
1) Generate a backup history file for backups taken at recovery as well.
2) Archive it if archive_mode = always.
3) Wait for both the segment of the stop point and the backup history
files to be archived before returning back to the caller of
pg_stop_backup.

It would be nice to get all that addressed in 10. Thoughts?
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Wed, Jul 5, 2017 at 4:57 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Jul 5, 2017 at 10:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I feel that since we cannot switch the WAL forcibly on the standby
>> server we need to find a new way to do so. I'm not sure but it might
>> be a hard work and be late for PG10. Or you meant that you have a idea
>> for this?
>
> Why not refactoring a bit do_pg_stop_backup() so as the wait phase
> happens even if a backup is started in recovery? Now wait_for_archive
> is ignored because no wait is happening and the stop point is directly
> returned back to the caller. For the wait actually happening, I don't
> have a better idea than documenting the fact that enforcing manually a
> segment switch on the primary needs to happen. That's better than
> having users including WAL in their base backups but not actually
> having everything they need. And I think that documenting that
> properly is better than restricting things that should work.

I agree with this idea. I've tried to make it wait for archiving but
it seems to me that there are other two issues we need to deal with:
the timeline ID on standby server is always reset after created a
restart point, and ThisTimeLineID variable of a backend process is not
set on the standby node when initializing. The former would be easy to
fix because resetting it is for debugging purposes. However, to deal
with latter issue, I'm considering the influence on other things.

> In most workloads, multiple WAL segments can be generated per second,
> and in even more of them a new segment generated would happen in less
> than a minute, so waiting for a segment switch on the primary should
> not be a problem for most users. The log letting user know about the
> wait should be more informative when things happen on a standby, like
> "waiting for segment to be finished or switched on the primary".
>
> If the restriction approach is preferred, I think that the check
> should happen in do_pg_stop_backup as well, and not in
> pg_stop_backup_v2 as your patch suggests. pg_basebackup is not able to
> do non-exclusive backups but this may happen some day, who knows..

Yeah, I understood.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Fri, Jul 7, 2017 at 4:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I agree with this idea. I've tried to make it wait for archiving but
> it seems to me that there are other two issues we need to deal with:
> the timeline ID on standby server is always reset after created a
> restart point, and ThisTimeLineID variable of a backend process is not
> set on the standby node when initializing. The former would be easy to
> fix because resetting it is for debugging purposes. However, to deal
> with latter issue, I'm considering the influence on other things.

ThisTimeLineID does not need to be touched. It seems to me that you
should just use the wait timeline as minRecoveryPointTLI, and wait for
segments using this value. The segment and backup history files to
wait for should be defined with minRecoveryPoint.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Fri, Jul 7, 2017 at 4:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jul 7, 2017 at 4:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I agree with this idea. I've tried to make it wait for archiving but
>> it seems to me that there are other two issues we need to deal with:
>> the timeline ID on standby server is always reset after created a
>> restart point, and ThisTimeLineID variable of a backend process is not
>> set on the standby node when initializing. The former would be easy to
>> fix because resetting it is for debugging purposes. However, to deal
>> with latter issue, I'm considering the influence on other things.
>
> ThisTimeLineID does not need to be touched. It seems to me that you
> should just use the wait timeline as minRecoveryPointTLI, and wait for
> segments using this value. The segment and backup history files to
> wait for should be defined with minRecoveryPoint.

Thank you for the advise. I'll post the patch later.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Fri, Jul 7, 2017 at 3:48 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Jul 5, 2017 at 4:57 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Why not refactoring a bit do_pg_stop_backup() so as the wait phase
>> happens even if a backup is started in recovery? Now wait_for_archive
>> is ignored because no wait is happening and the stop point is directly
>> returned back to the caller. For the wait actually happening, I don't
>> have a better idea than documenting the fact that enforcing manually a
>> segment switch on the primary needs to happen. That's better than
>> having users including WAL in their base backups but not actually
>> having everything they need. And I think that documenting that
>> properly is better than restricting things that should work.
>
> While looking at that in more details, I got surprised by two things:
> 1) No backup history file is generated on a standby during a base backup.
> 2) Because of 1), those files are not archived even if archive_mode = always.
>
> This sounds to me like a missing optimization of archive_mode =
> always, and the following comment block in xlog.c is at least
> incorrect as an archiver can be invoked in recovery:
>      * XXX currently a backup history file is for informational and debug
>      * purposes only. It's not essential for an online backup. Furthermore,
>      * even if it's created, it will not be archived during recovery because
>      * an archiver is not invoked. So it doesn't seem worthwhile to write a
>      * backup history file during recovery.
>
> So I would suggest the following things to address this issue:
> 1) Generate a backup history file for backups taken at recovery as well.
> 2) Archive it if archive_mode = always.
> 3) Wait for both the segment of the stop point and the backup history
> files to be archived before returning back to the caller of
> pg_stop_backup.
>
> It would be nice to get all that addressed in 10. Thoughts?

Yeah, I agree. Attached patch makes it works and deals with the
history file issue.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Noah Misch
Date:
On Fri, Jun 30, 2017 at 02:59:11PM -0400, Stephen Frost wrote:
> Peter, all,
> 
> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
> > On 6/30/17 04:08, Masahiko Sawada wrote:
> > >> I'm not sure. I think this can be considered a bug in the implementation for
> > >> 10, and as such is "open for fixing". However, it's not a very critical bug
> > >> so I doubt it should be a release blocker, but if someone wants to work on a
> > >> fix I think we should commit it.
> > > 
> > > I agree with you. I'd like to hear opinions from other hackers as well.
> > 
> > It's preferable to make it work.  If it's not easily possible, then we
> > should prohibit it.
> > 
> > Comments from Stephen (original committer)?
> 
> I agree that it'd be preferable to make it work, but I'm not sure I can
> commit to having it done in short order.  I'm happy to work to prohibit
> it, but if someone has a few spare cycles to make it actually work,
> that'd be great.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Stephen,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Stephen Frost
Date:
All,

* Noah Misch (noah@leadboat.com) wrote:
> On Fri, Jun 30, 2017 at 02:59:11PM -0400, Stephen Frost wrote:
> > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
> > > On 6/30/17 04:08, Masahiko Sawada wrote:
> > > >> I'm not sure. I think this can be considered a bug in the implementation for
> > > >> 10, and as such is "open for fixing". However, it's not a very critical bug
> > > >> so I doubt it should be a release blocker, but if someone wants to work on a
> > > >> fix I think we should commit it.
> > > >
> > > > I agree with you. I'd like to hear opinions from other hackers as well.
> > >
> > > It's preferable to make it work.  If it's not easily possible, then we
> > > should prohibit it.
> > >
> > > Comments from Stephen (original committer)?
> >
> > I agree that it'd be preferable to make it work, but I'm not sure I can
> > commit to having it done in short order.  I'm happy to work to prohibit
> > it, but if someone has a few spare cycles to make it actually work,
> > that'd be great.
>
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Stephen,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.

I'm out of town this week but will review the patch from Masahiko and
provide a status update by July 17th.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Sun, Jul 9, 2017 at 8:00 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Noah Misch (noah@leadboat.com) wrote:
>> [Action required within three days.  This is a generic notification.]
>>
>> The above-described topic is currently a PostgreSQL 10 open item.  Stephen,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.  Otherwise, please observe the policy on
>> open item ownership[1] and send a status update within three calendar days of
>> this message.  Include a date for your subsequent status update.  Testers may
>> discover new open items at any time, and I want to plan to get them all fixed
>> well in advance of shipping v10.  Consequently, I will appreciate your efforts
>> toward speedy resolution.  Thanks.
>
> I'm out of town this week but will review the patch from Masahiko and
> provide a status update by July 17th.

Good to know. I was planning to review the patch within 24 hours. I
have spotted at least one bug in the patch: there is no need to wait
for the backup history file and the last segment to be archived on a
standby unless archive_mode = always, meaning that
XLogArchivingAlways() needs to be used instead of
XLogArchivingActive(). There may be other things, but I am lacking
time and brain power now for a complete analysis.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Sat, Jul 8, 2017 at 12:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jul 7, 2017 at 3:48 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> So I would suggest the following things to address this issue:
>> 1) Generate a backup history file for backups taken at recovery as well.
>> 2) Archive it if archive_mode = always.
>> 3) Wait for both the segment of the stop point and the backup history
>> files to be archived before returning back to the caller of
>> pg_stop_backup.
>>
>> It would be nice to get all that addressed in 10. Thoughts?
>
> Yeah, I agree. Attached patch makes it works and deals with the
> history file issue.

I had a look at the proposed patch. Here are some comments.

@@ -11002,10 +11000,10 @@ do_pg_stop_backup(char *labelfile, bool
waitforarchive, TimeLineID *stoptli_p)   if (waitforarchive && XLogArchivingActive())   {
XLByteToPrevSeg(stoppoint,_logSegNo);
 
-       XLogFileName(lastxlogfilename, ThisTimeLineID, _logSegNo);
+       XLogFileName(lastxlogfilename, stoptli, _logSegNo);

On a standby the wait phase should not happen if archive_mode = on,
but only if archive_mode = always. So I would suggest to change this
upper condition a bit, and shuffle a bit the code to make the wait
phase happen last:
1) stoptli_p first.
2) Check for XLogArchivingActive or XLogArchivingAlways, then with the
NOTICE message.
3) Do the actual wait.
This way the code doing the wait does not need to be in its long
lengthy if() branch. I think that we should replace the pg_usleep()
call with a latch to make this more responsive. That should be a
future patch.

In backup history files generated in standbys, the STOP TIME is not
set and this results in garbage in the file.

+    If second parameter is true and on standby, <function>pg_stop_backup</>
+    waits for WAL to be archived without forcibly switching WAL on standby.
+    So enforcing manually a WAL switch on primary needs to happen.
Here is a reformulation:
If the second parameter wait_for_archive is true and the backup is
taken on a standby, pg_stop_backup waits for WAL to be archived when
archive_mode = always. Enforcing manually a WAL segment switch to
happen with for example pg_switch_wal() may be necessary if the
primary has low activity to allow the backup to complete. Using
statement_timeout to limit the amount of time to wait or switching
wait_for_archive to false will control the wait time, though all the
WAL segments necessary to recover into a consistent state from the
backup taken may not be archived at the time pg_stop_backup returns
its status to the caller.

The errhint() for the wait phase should be reworked for a standby. I
would suggest for errmsg() the same thing, aka:
pg_stop_backup still waiting for all required WAL segments to be
archived (%d seconds elapsed)
But the following for a backup started in recovery. That's long but
things need to be really clear to the user:
Backup has been taken from a standby, check if the WAL segments needed
for this backup have been completed, in which case a forced segment
switch may can be needed on the primary. If a segment switch has
already happened check that your archive_command is executing
properly. pg_stop_backup can be canceled safely, but the database
backup will not be usable without all the WAL segments.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Mon, Jul 10, 2017 at 11:56 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Jul 8, 2017 at 12:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Fri, Jul 7, 2017 at 3:48 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> So I would suggest the following things to address this issue:
>>> 1) Generate a backup history file for backups taken at recovery as well.
>>> 2) Archive it if archive_mode = always.
>>> 3) Wait for both the segment of the stop point and the backup history
>>> files to be archived before returning back to the caller of
>>> pg_stop_backup.
>>>
>>> It would be nice to get all that addressed in 10. Thoughts?
>>
>> Yeah, I agree. Attached patch makes it works and deals with the
>> history file issue.
>
> I had a look at the proposed patch. Here are some comments.

Thank you for reviewing the patch!

>
> @@ -11002,10 +11000,10 @@ do_pg_stop_backup(char *labelfile, bool
> waitforarchive, TimeLineID *stoptli_p)
>     if (waitforarchive && XLogArchivingActive())
>     {
>         XLByteToPrevSeg(stoppoint, _logSegNo);
> -       XLogFileName(lastxlogfilename, ThisTimeLineID, _logSegNo);
> +       XLogFileName(lastxlogfilename, stoptli, _logSegNo);
>
> On a standby the wait phase should not happen if archive_mode = on,
> but only if archive_mode = always. So I would suggest to change this
> upper condition a bit, and shuffle a bit the code to make the wait
> phase happen last:
> 1) stoptli_p first.
> 2) Check for XLogArchivingActive or XLogArchivingAlways, then with the
> NOTICE message.
> 3) Do the actual wait.

Thank you for the suggestion. Fixed.

> This way the code doing the wait does not need to be in its long
> lengthy if() branch. I think that we should replace the pg_usleep()
> call with a latch to make this more responsive. That should be a
> future patch.

Yeah, I agree.

> In backup history files generated in standbys, the STOP TIME is not
> set and this results in garbage in the file.

Fixed.

> +    If second parameter is true and on standby, <function>pg_stop_backup</>
> +    waits for WAL to be archived without forcibly switching WAL on standby.
> +    So enforcing manually a WAL switch on primary needs to happen.
> Here is a reformulation:
> If the second parameter wait_for_archive is true and the backup is
> taken on a standby, pg_stop_backup waits for WAL to be archived when
> archive_mode = always. Enforcing manually a WAL segment switch to
> happen with for example pg_switch_wal() may be necessary if the
> primary has low activity to allow the backup to complete. Using
> statement_timeout to limit the amount of time to wait or switching
> wait_for_archive to false will control the wait time, though all the
> WAL segments necessary to recover into a consistent state from the
> backup taken may not be archived at the time pg_stop_backup returns
> its status to the caller.

Thanks, fixed.

> The errhint() for the wait phase should be reworked for a standby. I
> would suggest for errmsg() the same thing, aka:
> pg_stop_backup still waiting for all required WAL segments to be
> archived (%d seconds elapsed)
> But the following for a backup started in recovery. That's long but
> things need to be really clear to the user:
> Backup has been taken from a standby, check if the WAL segments needed
> for this backup have been completed, in which case a forced segment
> switch may can be needed on the primary. If a segment switch has
> already happened check that your archive_command is executing
> properly. pg_stop_backup can be canceled safely, but the database
> backup will not be usable without all the WAL segments.

Fixed.

Attached updated version patch. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Tue, Jul 11, 2017 at 9:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached updated version patch. Please review it.

Cool, thanks.

+    useless. If the second parameter <parameter>wait_for_archive</> is true and
+    the backup is taken on a standby, <function>pg_stop_backup</> waits for WAL
+    to archived when <varname>archive_mode</> is <literal>always</>.  Enforcing
+    manually a WAL segment swtich to happen with for example
1) "waits for WAL to BE archived".
2) s/swtich/switch

+    to <literal>false</> will control the wait time, thought all the
WAL segments
s/thought/though/
               /*                * During recovery, since we don't use the end-of-backup WAL                * record
anddon't write the backup history file, the                * starting WAL location doesn't need to be unique.
 
This means                * that two base backups started at the same time
might use                * the same checkpoint as starting locations.                */
This comment in xlog.c needs an update. Two base backups started at
the same can generate a backup history file with the same offset, aka
same file name. I don't think that it matters much for this file
honestly. I think that this would become meaningful once such files
play a more important role, in which case having a unique identifier
would be way more interesting, but this day has IMO not come yet.
Other thoughts are welcome.
   if (waitforarchive && XLogArchivingActive())   {
+       /* Don't wait if WAL archiving not enabled always in recovery */
+       if (backup_started_in_recovery && !XLogArchivingAlways())
+           return stoppoint;
+
This has the smell of breakage waiting to happen, I think that we
should have just one single return point, which updates as well the
stop TLI at the end of the routine. This can just be a single
condition.

+   if (stoptli_p)
+       *stoptli_p = stoptli;
+
Not sure there is any point to move that around, on top of previous comment.

+                            errhint("Backup has been taken from a
standby, check if the WAL segment "
+                                    "needed for this backup have been
completed, in which case a "
+                                    "foreced segment switch may can
be needed on the primary. "
+                                    "If a segment swtich has already
happend check that your "
+                                    "archive_command is executing properly."
+                                    "pg_stop_backup can be canceled
safely, but the database backup "
+                                    "will not be usable without all
the WAL segments.")));
Some comments here:
1) s/foreced/forced/
2) s/may can/may/
3) s/swtich/switch/
4) s/happend/happened/
5) "Backup has been taken from a standby" should be a single sentence.

This is beginning to shape.

Thanks,
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Wed, Jul 12, 2017 at 5:06 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jul 11, 2017 at 9:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached updated version patch. Please review it.
>
> Cool, thanks.

Thank you for reviewing the patch.

>
> +    useless. If the second parameter <parameter>wait_for_archive</> is true and
> +    the backup is taken on a standby, <function>pg_stop_backup</> waits for WAL
> +    to archived when <varname>archive_mode</> is <literal>always</>.  Enforcing
> +    manually a WAL segment swtich to happen with for example
> 1) "waits for WAL to BE archived".
> 2) s/swtich/switch
>
> +    to <literal>false</> will control the wait time, thought all the
> WAL segments
> s/thought/though/
>
>                 /*
>                  * During recovery, since we don't use the end-of-backup WAL
>                  * record and don't write the backup history file, the
>                  * starting WAL location doesn't need to be unique.
> This means
>                  * that two base backups started at the same time
> might use
>                  * the same checkpoint as starting locations.
>                  */
> This comment in xlog.c needs an update. Two base backups started at
> the same can generate a backup history file with the same offset, aka
> same file name. I don't think that it matters much for this file
> honestly. I think that this would become meaningful once such files
> play a more important role, in which case having a unique identifier
> would be way more interesting, but this day has IMO not come yet.
> Other thoughts are welcome.
>
>     if (waitforarchive && XLogArchivingActive())
>     {
> +       /* Don't wait if WAL archiving not enabled always in recovery */
> +       if (backup_started_in_recovery && !XLogArchivingAlways())
> +           return stoppoint;
> +
> This has the smell of breakage waiting to happen, I think that we
> should have just one single return point, which updates as well the
> stop TLI at the end of the routine. This can just be a single
> condition.
>
> +   if (stoptli_p)
> +       *stoptli_p = stoptli;
> +
> Not sure there is any point to move that around, on top of previous comment.
>
> +                            errhint("Backup has been taken from a
> standby, check if the WAL segment "
> +                                    "needed for this backup have been
> completed, in which case a "
> +                                    "foreced segment switch may can
> be needed on the primary. "
> +                                    "If a segment swtich has already
> happend check that your "
> +                                    "archive_command is executing properly."
> +                                    "pg_stop_backup can be canceled
> safely, but the database backup "
> +                                    "will not be usable without all
> the WAL segments.")));
> Some comments here:
> 1) s/foreced/forced/
> 2) s/may can/may/
> 3) s/swtich/switch/
> 4) s/happend/happened/
> 5) "Backup has been taken from a standby" should be a single sentence.
>
> This is beginning to shape.
>

Sorry, I missed lots of typo in the last patch. All comments from you
are incorporated into the attached latest patch and I've checked it
whether there is other typos. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Thu, Jul 13, 2017 at 7:13 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Sorry, I missed lots of typo in the last patch. All comments from you
> are incorporated into the attached latest patch and I've checked it
> whether there is other typos. Please review it.

Thanks for providing a new version of the patch very quickly. I have
spent some time looking at it again and testing it, and this version
looks correct to me. Stephen, as a potential committer, would you look
at it to address this open item?
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Michael, all,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Thu, Jul 13, 2017 at 7:13 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Sorry, I missed lots of typo in the last patch. All comments from you
> > are incorporated into the attached latest patch and I've checked it
> > whether there is other typos. Please review it.
>
> Thanks for providing a new version of the patch very quickly. I have
> spent some time looking at it again and testing it, and this version
> looks correct to me. Stephen, as a potential committer, would you look
> at it to address this open item?

Yes, this weekend.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Fri, Jul 14, 2017 at 12:34 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Michael, all,
>
> * Michael Paquier (michael.paquier@gmail.com) wrote:
>> On Thu, Jul 13, 2017 at 7:13 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > Sorry, I missed lots of typo in the last patch. All comments from you
>> > are incorporated into the attached latest patch and I've checked it
>> > whether there is other typos. Please review it.
>>
>> Thanks for providing a new version of the patch very quickly. I have
>> spent some time looking at it again and testing it, and this version
>> looks correct to me. Stephen, as a potential committer, would you look
>> at it to address this open item?
>
> Yes, this weekend.

Thanks for the confirmation, Stephen!
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Masahiko, Michael,

* Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> > This is beginning to shape.
>
> Sorry, I missed lots of typo in the last patch. All comments from you
> are incorporated into the attached latest patch and I've checked it
> whether there is other typos. Please review it.

I've taken an initial look through the patch and it looks pretty
reasonable.  I need to go over it in more detail and work through
testing it myself next.

I expect to commit this (with perhaps some minor tweaks primairly around
punctuation/wording), barring any issues found, on Wednesday or Thursday
of this week.

Noah,

I'll provide an update regarding this open item by COB, Thursday July
20th.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Tue, Jul 18, 2017 at 1:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Masahiko, Michael,
>
> * Masahiko Sawada (sawada.mshk@gmail.com) wrote:
>> > This is beginning to shape.
>>
>> Sorry, I missed lots of typo in the last patch. All comments from you
>> are incorporated into the attached latest patch and I've checked it
>> whether there is other typos. Please review it.
>
> I've taken an initial look through the patch and it looks pretty
> reasonable.  I need to go over it in more detail and work through
> testing it myself next.
>
> I expect to commit this (with perhaps some minor tweaks primairly around
> punctuation/wording), barring any issues found, on Wednesday or Thursday
> of this week.

I understood. Thank you for looking at this!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Masahiko, all,

* Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> On Tue, Jul 18, 2017 at 1:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Masahiko, Michael,
> >
> > * Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> >> > This is beginning to shape.
> >>
> >> Sorry, I missed lots of typo in the last patch. All comments from you
> >> are incorporated into the attached latest patch and I've checked it
> >> whether there is other typos. Please review it.
> >
> > I've taken an initial look through the patch and it looks pretty
> > reasonable.  I need to go over it in more detail and work through
> > testing it myself next.
> >
> > I expect to commit this (with perhaps some minor tweaks primairly around
> > punctuation/wording), barring any issues found, on Wednesday or Thursday
> > of this week.
>
> I understood. Thank you for looking at this!

I started discussing this with David off-list and he'd like a chance to
review it in a bit more detail (he's just returning from being gone for
a few weeks).  That review will be posted to this thread on Monday, and
I'll wait until then to move forward with the patch.

Next update will be before Tuesday, July 25th, COB.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Sat, Jul 22, 2017 at 8:04 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Masahiko, all,
>
> * Masahiko Sawada (sawada.mshk@gmail.com) wrote:
>> On Tue, Jul 18, 2017 at 1:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> > Masahiko, Michael,
>> >
>> > * Masahiko Sawada (sawada.mshk@gmail.com) wrote:
>> >> > This is beginning to shape.
>> >>
>> >> Sorry, I missed lots of typo in the last patch. All comments from you
>> >> are incorporated into the attached latest patch and I've checked it
>> >> whether there is other typos. Please review it.
>> >
>> > I've taken an initial look through the patch and it looks pretty
>> > reasonable.  I need to go over it in more detail and work through
>> > testing it myself next.
>> >
>> > I expect to commit this (with perhaps some minor tweaks primairly around
>> > punctuation/wording), barring any issues found, on Wednesday or Thursday
>> > of this week.
>>
>> I understood. Thank you for looking at this!
>
> I started discussing this with David off-list and he'd like a chance to
> review it in a bit more detail (he's just returning from being gone for
> a few weeks).  That review will be posted to this thread on Monday, and
> I'll wait until then to move forward with the patch.
>
> Next update will be before Tuesday, July 25th, COB.
>

Thank you! I'll start to revise the patch as soon as I got review comments.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
David Steele
Date:
On 7/23/17 11:48 PM, Masahiko Sawada wrote:
> On Sat, Jul 22, 2017 at 8:04 AM, Stephen Frost <sfrost@snowman.net> wrote:
>>
>> I started discussing this with David off-list and he'd like a chance to
>> review it in a bit more detail (he's just returning from being gone for
>> a few weeks).  That review will be posted to this thread on Monday, and
>> I'll wait until then to move forward with the patch.

Before reviewing the patch, I would note that it looks like this issue 
was introduced in b6a323a8c before the 9.6 release.  The documentation 
states that the pg_stop_backup() function will wait for all required WAL 
to be archived, which corresponds to the default in the new patch of 
waitforarchive = true.  The commit notes that the documentation needs 
updating, but since that didn't happen I think we should consider this a 
bug in 9.6 and back patch.

While this patch brings pg_stop_backup() in line with the documentation, 
it also introduces a behavioral change compared to 9.6.  Currently, the 
default 9.6 behavior on a standby is to return immediately from 
pg_stop_backup(), but this patch will cause it to block by default 
instead.  Since action on the master may be required to unblock the 
process, I see this as a pretty significant change.  I still think we 
should fix and backpatch, but I wanted to point this out.

The patch looks sensible to me.  A few comments:

1) I would change:

"Check if the WAL segment needed for this backup have been completed, in 
which case a forced segment switch may be needed on the primary."

To something like:

"If the WAL segments required for this backup have not been archived 
then it might be necessary to force a segment switch on the primary."

2) At backup.sgml L1015 it says that pg_stop_backup() will automatically 
switch the WAL segment.  There should be a caveat here for standby 
backups, like:

When called on a master it terminates the backup mode and performs an 
automatic switch to the next WAL segment.  The reason for the switch is 
to arrange for the last WAL segment written during the backup interval 
to be ready to archive.  When called on a standby the WAL segment switch 
must be performed manually on the master if it does not happen due to 
normal write activity.

3) The fact that this fix requires "archive_mode = always" seems like a 
pretty big caveat, thought I don't have any ideas on how to make it 
better without big code changes.  Maybe it would be help to change:

+ the backup is taken on a standby, <function>pg_stop_backup</> waits
+ for WAL to be archived when <varname>archive_mode</>

To:

+ the backup is taken on a standby, <function>pg_stop_backup</> waits
+ for WAL to be archived *only* when <varname>archive_mode</>

Perhaps this should be noted in the pg_basebackup docs as well?

Regards,
-- 
-David
david@pgmasters.net



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Mon, Jul 24, 2017 at 11:40 AM, David Steele <david@pgmasters.net> wrote:
> Before reviewing the patch, I would note that it looks like this issue was
> introduced in b6a323a8c before the 9.6 release.  The documentation states
> that the pg_stop_backup() function will wait for all required WAL to be
> archived, which corresponds to the default in the new patch of
> waitforarchive = true.  The commit notes that the documentation needs
> updating, but since that didn't happen I think we should consider this a bug
> in 9.6 and back patch.
>
> While this patch brings pg_stop_backup() in line with the documentation, it
> also introduces a behavioral change compared to 9.6.  Currently, the default
> 9.6 behavior on a standby is to return immediately from pg_stop_backup(),
> but this patch will cause it to block by default instead.  Since action on
> the master may be required to unblock the process, I see this as a pretty
> significant change.  I still think we should fix and backpatch, but I wanted
> to point this out.

Hmm.  That seems to me like it could break backup scripts that are
currently working, which seems like a *very* unfriendly thing to do in
a minor release, especially since 9.6 has no option to override the
default behavior (nor can we add one, since it would require a change
to catalog contents).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
David,

* David Steele (david@pgmasters.net) wrote:
> On 7/23/17 11:48 PM, Masahiko Sawada wrote:
> >On Sat, Jul 22, 2017 at 8:04 AM, Stephen Frost <sfrost@snowman.net> wrote:
> >>
> >>I started discussing this with David off-list and he'd like a chance to
> >>review it in a bit more detail (he's just returning from being gone for
> >>a few weeks).  That review will be posted to this thread on Monday, and
> >>I'll wait until then to move forward with the patch.
>
> Before reviewing the patch, I would note that it looks like this
> issue was introduced in b6a323a8c before the 9.6 release.  The
> documentation states that the pg_stop_backup() function will wait
> for all required WAL to be archived, which corresponds to the
> default in the new patch of waitforarchive = true.  The commit notes
> that the documentation needs updating, but since that didn't happen
> I think we should consider this a bug in 9.6 and back patch.

I tend to agree with this.  The documentation clearly says that
pg_stop_backup() waits for all WAL to be archived, but, today, if it's
run on a standby then it doesn't wait, which might lead to invalid
backups if the backup software is depending on that.

> While this patch brings pg_stop_backup() in line with the
> documentation, it also introduces a behavioral change compared to
> 9.6.  Currently, the default 9.6 behavior on a standby is to return
> immediately from pg_stop_backup(), but this patch will cause it to
> block by default instead.  Since action on the master may be
> required to unblock the process, I see this as a pretty significant
> change.  I still think we should fix and backpatch, but I wanted to
> point this out.

This will need to be highlighted in the release notes for 9.6.4 also,
assuming there is agreement to back-patch this, and we'll need to update
the documentation in 9.6 also to be clearer about what happens on a
standby.

> The patch looks sensible to me.  A few comments:
>
> 1) I would change:
>
> "Check if the WAL segment needed for this backup have been
> completed, in which case a forced segment switch may be needed on
> the primary."
>
> To something like:
>
> "If the WAL segments required for this backup have not been archived
> then it might be necessary to force a segment switch on the
> primary."

I'm a bit on the fence regarding the distinction here for the
backup-from-standby and this errhint().  The issue is that the WAL for
the backup hasn't been archived yet and that may be because of either:

archive_command is failing
OR
No WAL is getting generated

In either case, both of these apply to the primary and the standby.  As
such, I'm inclined to try to mention both in the original errhint()
instead of making two different ones.  I'm open to suggestions on this,
of course, but my thinking would be more like:

-----
Either archive_command is failing or not enough WAL has been generated
to require a segment switch.  Run pg_switch_wal() to request a WAL
switch and monitor your logs to check that your archive_command is
executing properly.
-----

And then change pg_switch_wal()'s errhint for when it's run on a replica
to be:

----
HINT: WAL control functions cannont be executed during recovery; they
should be executed on the primary instead.
----

(or similar, again, open to suggestions here).

> 2) At backup.sgml L1015 it says that pg_stop_backup() will
> automatically switch the WAL segment.  There should be a caveat here
> for standby backups, like:
>
> When called on a master it terminates the backup mode and performs
> an automatic switch to the next WAL segment.  The reason for the
> switch is to arrange for the last WAL segment written during the
> backup interval to be ready to archive.  When called on a standby
> the WAL segment switch must be performed manually on the master if
> it does not happen due to normal write activity.

s/master/primary/g

Otherwise it looks alright..  Might try to reword to use similar
language as to what we have today in 25.3.3.1.

> 3) The fact that this fix requires "archive_mode = always" seems
> like a pretty big caveat, thought I don't have any ideas on how to
> make it better without big code changes.  Maybe it would be help to
> change:
>
> + the backup is taken on a standby, <function>pg_stop_backup</> waits
> + for WAL to be archived when <varname>archive_mode</>
>
> To:
>
> + the backup is taken on a standby, <function>pg_stop_backup</> waits
> + for WAL to be archived *only* when <varname>archive_mode</>

I'm thinking of rewording that a bit to say "When archive_mode = always"
instead, as I think that might be a little clearer.

> Perhaps this should be noted in the pg_basebackup docs as well?

That seems like it's probably a good idea too, as people might wonder
why pg_basebackup hasn't finished yet if it's waiting for WAL to be
archived.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Jul 24, 2017 at 11:40 AM, David Steele <david@pgmasters.net> wrote:
> > Before reviewing the patch, I would note that it looks like this issue was
> > introduced in b6a323a8c before the 9.6 release.  The documentation states
> > that the pg_stop_backup() function will wait for all required WAL to be
> > archived, which corresponds to the default in the new patch of
> > waitforarchive = true.  The commit notes that the documentation needs
> > updating, but since that didn't happen I think we should consider this a bug
> > in 9.6 and back patch.
> >
> > While this patch brings pg_stop_backup() in line with the documentation, it
> > also introduces a behavioral change compared to 9.6.  Currently, the default
> > 9.6 behavior on a standby is to return immediately from pg_stop_backup(),
> > but this patch will cause it to block by default instead.  Since action on
> > the master may be required to unblock the process, I see this as a pretty
> > significant change.  I still think we should fix and backpatch, but I wanted
> > to point this out.
>
> Hmm.  That seems to me like it could break backup scripts that are
> currently working, which seems like a *very* unfriendly thing to do in
> a minor release, especially since 9.6 has no option to override the
> default behavior (nor can we add one, since it would require a change
> to catalog contents).

Those backup scripts might very well be, today, producing invalid
backups though, which would be much less good..

I'd hate to have to do it, but we could technically add a GUC to address
this in the back-branches, no?  I'm not sure that's really worthwhile
though..

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Mon, Jul 24, 2017 at 12:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Those backup scripts might very well be, today, producing invalid
> backups though, which would be much less good..

True.  However, I suspect that depends on what procedure is actually
being followed.  If *everyone* who is using this is getting corrupt
backups, then of course changing the behavior is a no-brainer.  But if
*some* people are getting correct backups and *some* people are
getting incorrect backups, depending on procedure, then I think
changing it is unwise.  We should optimize for the case of a user who
is currently doing something smart, not one who is doing something
dumb.

> I'd hate to have to do it, but we could technically add a GUC to address
> this in the back-branches, no?  I'm not sure that's really worthwhile
> though..

That would be mighty ugly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Jul 24, 2017 at 12:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Those backup scripts might very well be, today, producing invalid
> > backups though, which would be much less good..
>
> True.  However, I suspect that depends on what procedure is actually
> being followed.  If *everyone* who is using this is getting corrupt
> backups, then of course changing the behavior is a no-brainer.  But if
> *some* people are getting correct backups and *some* people are
> getting incorrect backups, depending on procedure, then I think
> changing it is unwise.  We should optimize for the case of a user who
> is currently doing something smart, not one who is doing something
> dumb.

I'm not sure that we can call "writing code that depends on what the
docs say" dumb, unfortunately, and if even one person is getting an
invalid backup while following what the docs say then that's a strong
case to do *something*.  Of course, we also don't want to break the
scripts of people who are doing things correctly, but I'm trying to
figure out exactly how this change would break such scripts.

What the change would do is make the pg_stop_backup() caller block until
the last WAL is archvied, and perhaps that ends up taking hours, and
then the connection is dropped for whatever reason and the backup fails
where it otherwise.... what?  wouldn't have been valid anyway at that
point, since it's not valid until the last WAL is actually archived.
Perhaps eventually it would be archived and the caller was planning for
that and everything is fine, but, well, that feels like an awful lot of
wishful thinking.

And that's assuming that the script author made sure to write additional
code that didn't mark the backup as valid until the ending WAL segment
from pg_stop_backup() was confirmed to have been archived.

Last, but perhaps not least, this is at least just an issue going back
to where pg_start/stop_backup was allowed on replicas, which is only 9.6
and therefore it's been out less than a year and anyone's script which
might break due to this would at least be relatively new code.

> > I'd hate to have to do it, but we could technically add a GUC to address
> > this in the back-branches, no?  I'm not sure that's really worthwhile
> > though..
>
> That would be mighty ugly.

Oh, absolutely agreed.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
David Steele
Date:
On 7/24/17 12:28 PM, Stephen Frost wrote:
> 
> * David Steele (david@pgmasters.net) wrote:
> 
>> While this patch brings pg_stop_backup() in line with the
>> documentation, it also introduces a behavioral change compared to
>> 9.6.  Currently, the default 9.6 behavior on a standby is to return
>> immediately from pg_stop_backup(), but this patch will cause it to
>> block by default instead.  Since action on the master may be
>> required to unblock the process, I see this as a pretty significant
>> change.  I still think we should fix and backpatch, but I wanted to
>> point this out.
> 
> This will need to be highlighted in the release notes for 9.6.4 also,
> assuming there is agreement to back-patch this, and we'll need to update
> the documentation in 9.6 also to be clearer about what happens on a
> standby.

Agreed.

>> "If the WAL segments required for this backup have not been archived
>> then it might be necessary to force a segment switch on the
>> primary."
> 
> I'm a bit on the fence regarding the distinction here for the
> backup-from-standby and this errhint().  The issue is that the WAL for
> the backup hasn't been archived yet and that may be because of either:
> 
> archive_command is failing
> OR
> No WAL is getting generated
> 
> In either case, both of these apply to the primary and the standby.  As
> such, I'm inclined to try to mention both in the original errhint()
> instead of making two different ones.  I'm open to suggestions on this,
> of course, but my thinking would be more like:
> 
> -----
> Either archive_command is failing or not enough WAL has been generated
> to require a segment switch.  Run pg_switch_wal() to request a WAL
> switch and monitor your logs to check that your archive_command is
> executing properly.
> -----

Yes, that seems more concise.  I like the idea of not having to maintain 
two separate hints.

> And then change pg_switch_wal()'s errhint for when it's run on a replica
> to be:
> 
> ----
> HINT: WAL control functions cannont be executed during recovery; they
> should be executed on the primary instead.
> ----

Looks good to me.  This explanation is useful in general.

>> 2) At backup.sgml L1015 it says that pg_stop_backup() will
>> automatically switch the WAL segment.  There should be a caveat here
>> for standby backups, like:
>>
>> When called on a master it terminates the backup mode and performs
>> an automatic switch to the next WAL segment.  The reason for the
>> switch is to arrange for the last WAL segment written during the
>> backup interval to be ready to archive.  When called on a standby
>> the WAL segment switch must be performed manually on the master if
>> it does not happen due to normal write activity.
> 
> s/master/primary/g

Yes.

>> 3) The fact that this fix requires "archive_mode = always" seems
>> like a pretty big caveat, thought I don't have any ideas on how to
>> make it better without big code changes.  Maybe it would be help to
>> change:
>>
>> + the backup is taken on a standby, <function>pg_stop_backup</> waits
>> + for WAL to be archived when <varname>archive_mode</>
>>
>> To:
>>
>> + the backup is taken on a standby, <function>pg_stop_backup</> waits
>> + for WAL to be archived *only* when <varname>archive_mode</>
> 
> I'm thinking of rewording that a bit to say "When archive_mode = always"
> instead, as I think that might be a little clearer.

Sure.

>> Perhaps this should be noted in the pg_basebackup docs as well?
> 
> That seems like it's probably a good idea too, as people might wonder
> why pg_basebackup hasn't finished yet if it's waiting for WAL to be
> archived.

Yes, and this is another behavioral change to consider -- one that is 
more likely to impact users than the change to pg_stop_backup().  If 
pg_basebackup is not currently waiting for WAL on a standby (but does on 
a primary) then that's pretty serious.

Any thoughts on this, Magnus?

-- 
-David
david@pgmasters.net



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Mon, Jul 24, 2017 at 6:45 PM, Stephen Frost <sfrost@snowman.net> wrote:
> What the change would do is make the pg_stop_backup() caller block until
> the last WAL is archvied, and perhaps that ends up taking hours, and
> then the connection is dropped for whatever reason and the backup fails
> where it otherwise.... what?  wouldn't have been valid anyway at that
> point, since it's not valid until the last WAL is actually archived.
> Perhaps eventually it would be archived and the caller was planning for
> that and everything is fine, but, well, that feels like an awful lot of
> wishful thinking.

Letting users taking unconsciously inconsistent backups is worse than
potentially breaking scripts that were actually not working as
Postgres would expect. So I am +1 for back-patching a lighter version
of the proposed patch that makes the wait happen on purpose.

>> > I'd hate to have to do it, but we could technically add a GUC to address
>> > this in the back-branches, no?  I'm not sure that's really worthwhile
>> > though..
>>
>> That would be mighty ugly.
>
> Oh, absolutely agreed.

Yes, let's avoid that. We are talking about a switch aimed at making
backups potentially inconsistent.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
David Steele
Date:
On 7/24/17 3:28 PM, David Steele wrote:
> 
> Yes, and this is another behavioral change to consider -- one that is 
> more likely to impact users than the change to pg_stop_backup().  If 
> pg_basebackup is not currently waiting for WAL on a standby (but does on 
> a primary) then that's pretty serious.

My bad, before PG10 pg_basebackup did not check that WAL was archived on 
a primary *or* a standby unless the -x option was specified, as documented.

-- 
-David
david@pgmasters.net



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Tue, Jul 25, 2017 at 4:43 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jul 24, 2017 at 6:45 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> What the change would do is make the pg_stop_backup() caller block until
>> the last WAL is archvied, and perhaps that ends up taking hours, and
>> then the connection is dropped for whatever reason and the backup fails
>> where it otherwise.... what?  wouldn't have been valid anyway at that
>> point, since it's not valid until the last WAL is actually archived.
>> Perhaps eventually it would be archived and the caller was planning for
>> that and everything is fine, but, well, that feels like an awful lot of
>> wishful thinking.
>
> Letting users taking unconsciously inconsistent backups is worse than
> potentially breaking scripts that were actually not working as
> Postgres would expect. So I am +1 for back-patching a lighter version
> of the proposed patch that makes the wait happen on purpose.
>
>>> > I'd hate to have to do it, but we could technically add a GUC to address
>>> > this in the back-branches, no?  I'm not sure that's really worthwhile
>>> > though..
>>>
>>> That would be mighty ugly.
>>
>> Oh, absolutely agreed.
>
> Yes, let's avoid that. We are talking about a switch aimed at making
> backups potentially inconsistent.

Thank you for the review comments!
Attached updated the patch. The noting in pg_baseback doc will be
necessary for back branches if we decided to not back-patch it to back
branches. So it's not contained in this patch for now.

Regarding back-patching this to back branches, I also vote for
back-patching to back branches. Or we can fix the docs of back
branches and fix the code only in PG10. I expect that the user who
wrote a backup script has done enough functional test and dealt with
this issue somehow, but since the current doc clearly says that
pg_stop_backup() waits for all WAL to be archived we have to make a
consideration about there are users who wrote a wrong backup script.
So I think we should at least notify it in the minor release.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
All,

* Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> On Tue, Jul 25, 2017 at 4:43 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Mon, Jul 24, 2017 at 6:45 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >> What the change would do is make the pg_stop_backup() caller block until
> >> the last WAL is archvied, and perhaps that ends up taking hours, and
> >> then the connection is dropped for whatever reason and the backup fails
> >> where it otherwise.... what?  wouldn't have been valid anyway at that
> >> point, since it's not valid until the last WAL is actually archived.
> >> Perhaps eventually it would be archived and the caller was planning for
> >> that and everything is fine, but, well, that feels like an awful lot of
> >> wishful thinking.
> >
> > Letting users taking unconsciously inconsistent backups is worse than
> > potentially breaking scripts that were actually not working as
> > Postgres would expect. So I am +1 for back-patching a lighter version
> > of the proposed patch that makes the wait happen on purpose.
> >
> >>> > I'd hate to have to do it, but we could technically add a GUC to address
> >>> > this in the back-branches, no?  I'm not sure that's really worthwhile
> >>> > though..
> >>>
> >>> That would be mighty ugly.
> >>
> >> Oh, absolutely agreed.
> >
> > Yes, let's avoid that. We are talking about a switch aimed at making
> > backups potentially inconsistent.
>
> Thank you for the review comments!
> Attached updated the patch. The noting in pg_baseback doc will be
> necessary for back branches if we decided to not back-patch it to back
> branches. So it's not contained in this patch for now.
>
> Regarding back-patching this to back branches, I also vote for
> back-patching to back branches. Or we can fix the docs of back
> branches and fix the code only in PG10. I expect that the user who
> wrote a backup script has done enough functional test and dealt with
> this issue somehow, but since the current doc clearly says that
> pg_stop_backup() waits for all WAL to be archived we have to make a
> consideration about there are users who wrote a wrong backup script.
> So I think we should at least notify it in the minor release.

That sounds like we have at least three folks thinking that this should
be back-patched, and one who doesn't.  Does anyone else wish to voice an
opinion regarding back-patching this..?

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Noah Misch
Date:
On Fri, Jul 21, 2017 at 07:04:32PM -0400, Stephen Frost wrote:
> Masahiko, all,
> 
> * Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> > On Tue, Jul 18, 2017 at 1:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > > Masahiko, Michael,
> > >
> > > * Masahiko Sawada (sawada.mshk@gmail.com) wrote:
> > >> > This is beginning to shape.
> > >>
> > >> Sorry, I missed lots of typo in the last patch. All comments from you
> > >> are incorporated into the attached latest patch and I've checked it
> > >> whether there is other typos. Please review it.
> > >
> > > I've taken an initial look through the patch and it looks pretty
> > > reasonable.  I need to go over it in more detail and work through
> > > testing it myself next.
> > >
> > > I expect to commit this (with perhaps some minor tweaks primairly around
> > > punctuation/wording), barring any issues found, on Wednesday or Thursday
> > > of this week.
> > 
> > I understood. Thank you for looking at this!
> 
> I started discussing this with David off-list and he'd like a chance to
> review it in a bit more detail (he's just returning from being gone for
> a few weeks).  That review will be posted to this thread on Monday, and
> I'll wait until then to move forward with the patch.
> 
> Next update will be before Tuesday, July 25th, COB.

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Stephen Frost
Date:
Noah, all,

* Noah Misch (noah@leadboat.com) wrote:
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:

Based on the ongoing discussion, this is really looking like it's
actually a fix that needs to be back-patched to 9.6 rather than a PG10
open item.  I don't have any issue with keeping it as an open item
though, just mentioning it.  I'll provide another status update on or
before Monday, July 31st.

I'll get to work on the back-patch and try to draft up something to go
into the release notes for 9.6.4.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Thu, Jul 27, 2017 at 10:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Noah Misch (noah@leadboat.com) wrote:
>> This PostgreSQL 10 open item is past due for your status update.  Kindly send
>> a status update within 24 hours, and include a date for your subsequent status
>> update.  Refer to the policy on open item ownership:
>
> Based on the ongoing discussion, this is really looking like it's
> actually a fix that needs to be back-patched to 9.6 rather than a PG10
> open item.  I don't have any issue with keeping it as an open item
> though, just mentioning it.  I'll provide another status update on or
> before Monday, July 31st.
>
> I'll get to work on the back-patch and try to draft up something to go
> into the release notes for 9.6.4.

Whether this is going to be back-patched or not, you should do
something about it quickly, because we're wrapping a new beta and a
full set of back-branch releases next week.  I'm personally hoping
that what follows beta3 will be rc1, but if we have too much churn
after beta3 we'll end up with a beta4, which could end up slipping the
whole release cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Jul 27, 2017 at 10:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Noah Misch (noah@leadboat.com) wrote:
> >> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> >> a status update within 24 hours, and include a date for your subsequent status
> >> update.  Refer to the policy on open item ownership:
> >
> > Based on the ongoing discussion, this is really looking like it's
> > actually a fix that needs to be back-patched to 9.6 rather than a PG10
> > open item.  I don't have any issue with keeping it as an open item
> > though, just mentioning it.  I'll provide another status update on or
> > before Monday, July 31st.
> >
> > I'll get to work on the back-patch and try to draft up something to go
> > into the release notes for 9.6.4.
>
> Whether this is going to be back-patched or not, you should do
> something about it quickly, because we're wrapping a new beta and a
> full set of back-branch releases next week.  I'm personally hoping
> that what follows beta3 will be rc1, but if we have too much churn
> after beta3 we'll end up with a beta4, which could end up slipping the
> whole release cycle.

Yes, I've been working on this and the other issues with pg_dump today.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Noah Misch
Date:
On Thu, Jul 27, 2017 at 10:27:36AM -0400, Stephen Frost wrote:
> Noah, all,
> 
> * Noah Misch (noah@leadboat.com) wrote:
> > This PostgreSQL 10 open item is past due for your status update.  Kindly send
> > a status update within 24 hours, and include a date for your subsequent status
> > update.  Refer to the policy on open item ownership:
> 
> Based on the ongoing discussion, this is really looking like it's
> actually a fix that needs to be back-patched to 9.6 rather than a PG10
> open item.  I don't have any issue with keeping it as an open item
> though, just mentioning it.  I'll provide another status update on or
> before Monday, July 31st.

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Mon, Jul 31, 2017 at 9:13 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Thu, Jul 27, 2017 at 10:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> > * Noah Misch (noah@leadboat.com) wrote:
>> >> This PostgreSQL 10 open item is past due for your status update.  Kindly send
>> >> a status update within 24 hours, and include a date for your subsequent status
>> >> update.  Refer to the policy on open item ownership:
>> >
>> > Based on the ongoing discussion, this is really looking like it's
>> > actually a fix that needs to be back-patched to 9.6 rather than a PG10
>> > open item.  I don't have any issue with keeping it as an open item
>> > though, just mentioning it.  I'll provide another status update on or
>> > before Monday, July 31st.
>> >
>> > I'll get to work on the back-patch and try to draft up something to go
>> > into the release notes for 9.6.4.
>>
>> Whether this is going to be back-patched or not, you should do
>> something about it quickly, because we're wrapping a new beta and a
>> full set of back-branch releases next week.  I'm personally hoping
>> that what follows beta3 will be rc1, but if we have too much churn
>> after beta3 we'll end up with a beta4, which could end up slipping the
>> whole release cycle.
>
> Yes, I've been working on this and the other issues with pg_dump today.

Do you need a back-patchable version for 9.6? I could get one out of
my pocket if necessary.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Jul 31, 2017 at 9:13 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> >> On Thu, Jul 27, 2017 at 10:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> >> > * Noah Misch (noah@leadboat.com) wrote:
> >> >> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> >> >> a status update within 24 hours, and include a date for your subsequent status
> >> >> update.  Refer to the policy on open item ownership:
> >> >
> >> > Based on the ongoing discussion, this is really looking like it's
> >> > actually a fix that needs to be back-patched to 9.6 rather than a PG10
> >> > open item.  I don't have any issue with keeping it as an open item
> >> > though, just mentioning it.  I'll provide another status update on or
> >> > before Monday, July 31st.
> >> >
> >> > I'll get to work on the back-patch and try to draft up something to go
> >> > into the release notes for 9.6.4.
> >>
> >> Whether this is going to be back-patched or not, you should do
> >> something about it quickly, because we're wrapping a new beta and a
> >> full set of back-branch releases next week.  I'm personally hoping
> >> that what follows beta3 will be rc1, but if we have too much churn
> >> after beta3 we'll end up with a beta4, which could end up slipping the
> >> whole release cycle.
> >
> > Yes, I've been working on this and the other issues with pg_dump today.
>
> Do you need a back-patchable version for 9.6? I could get one out of
> my pocket if necessary.

I was just trying to find a bit of time to generate exactly that- if
you have a couple spare cycles, it would certainly help.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Stephen Frost
Date:
Noah,

On Tue, Aug 1, 2017 at 20:52 Noah Misch <noah@leadboat.com> wrote:
On Thu, Jul 27, 2017 at 10:27:36AM -0400, Stephen Frost wrote:
> Noah, all,
>
> * Noah Misch (noah@leadboat.com) wrote:
> > This PostgreSQL 10 open item is past due for your status update.  Kindly send
> > a status update within 24 hours, and include a date for your subsequent status
> > update.  Refer to the policy on open item ownership:
>
> Based on the ongoing discussion, this is really looking like it's
> actually a fix that needs to be back-patched to 9.6 rather than a PG10
> open item.  I don't have any issue with keeping it as an open item
> though, just mentioning it.  I'll provide another status update on or
> before Monday, July 31st.

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  

I'll provide another update tomorrow.  Hopefully Michael is able to produce a 9.6 patch, otherwise I'll do it. 

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Wed, Aug 2, 2017 at 7:58 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Michael Paquier (michael.paquier@gmail.com) wrote:
>> Do you need a back-patchable version for 9.6? I could get one out of
>> my pocket if necessary.
>
> I was just trying to find a bit of time to generate exactly that- if
> you have a couple spare cycles, it would certainly help.

OK, here you go. Even if archive_mode = always has been introduced in
9.5, but non-exclusive mode is a 9.6 feature, so here are patches down
to this version. I am pretty satisfied by this, and I included all the
comments and error message corrections reviewed up to now. I noticed
some incorrect comments, doc typos and an incorrect indentation as
well for the WARNING reported to the user when waiting for the
archiving.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Thu, Aug 3, 2017 at 4:29 AM, Stephen Frost <sfrost@snowman.net> wrote:
> I'll provide another update tomorrow.  Hopefully Michael is able to produce
> a 9.6 patch, otherwise I'll do it.

I have sent an updated version of the patch, with something that can
be used for 9.6 as well. It would be nice to get something into the
next set of minor releases.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Michael,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Wed, Aug 2, 2017 at 7:58 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Michael Paquier (michael.paquier@gmail.com) wrote:
> >> Do you need a back-patchable version for 9.6? I could get one out of
> >> my pocket if necessary.
> >
> > I was just trying to find a bit of time to generate exactly that- if
> > you have a couple spare cycles, it would certainly help.
>
> OK, here you go. Even if archive_mode = always has been introduced in
> 9.5, but non-exclusive mode is a 9.6 feature, so here are patches down
> to this version. I am pretty satisfied by this, and I included all the
> comments and error message corrections reviewed up to now. I noticed
> some incorrect comments, doc typos and an incorrect indentation as
> well for the WARNING reported to the user when waiting for the
> archiving.

Thanks much for this.  I've looked over it some and it looks pretty good
to me.  I'm going to play with it a bit more and, barring issues, will
push them tomorrow morning.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Stephen Frost
Date:
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Thu, Aug 3, 2017 at 4:29 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > I'll provide another update tomorrow.  Hopefully Michael is able to produce
> > a 9.6 patch, otherwise I'll do it.
>
> I have sent an updated version of the patch, with something that can
> be used for 9.6 as well. It would be nice to get something into the
> next set of minor releases.

Thanks for the patches.  I'm planning to push them tomorrow morning
after a bit more review and testing.  I'll provide an update tomorrow.

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Masahiko Sawada
Date:
On Fri, Aug 4, 2017 at 10:48 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Michael,
>
> * Michael Paquier (michael.paquier@gmail.com) wrote:
>> On Wed, Aug 2, 2017 at 7:58 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> > * Michael Paquier (michael.paquier@gmail.com) wrote:
>> >> Do you need a back-patchable version for 9.6? I could get one out of
>> >> my pocket if necessary.
>> >
>> > I was just trying to find a bit of time to generate exactly that- if
>> > you have a couple spare cycles, it would certainly help.
>>
>> OK, here you go. Even if archive_mode = always has been introduced in
>> 9.5, but non-exclusive mode is a 9.6 feature, so here are patches down
>> to this version. I am pretty satisfied by this, and I included all the
>> comments and error message corrections reviewed up to now. I noticed
>> some incorrect comments, doc typos and an incorrect indentation as
>> well for the WARNING reported to the user when waiting for the
>> archiving.
>
> Thanks much for this.  I've looked over it some and it looks pretty good
> to me.  I'm going to play with it a bit more and, barring issues, will
> push them tomorrow morning.
>

Thank you for the patches, Michael-san! It looks good to me too.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Thu, Aug 3, 2017 at 9:49 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Thanks for the patches.  I'm planning to push them tomorrow morning
> after a bit more review and testing.  I'll provide an update tomorrow.

Obviously, the part about pushing them Friday morning didn't happen,
and you're running out of time for providing an update on Friday, too.
The release is due to wrap in less than 72 hours.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Stephen Frost
Date:
Robert,

On Fri, Aug 4, 2017 at 23:17 Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 3, 2017 at 9:49 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Thanks for the patches.  I'm planning to push them tomorrow morning
> after a bit more review and testing.  I'll provide an update tomorrow.

Obviously, the part about pushing them Friday morning didn't happen,
and you're running out of time for providing an update on Friday, too.
The release is due to wrap in less than 72 hours.

Unfortunately the day got away from me due to some personal... adventures (having to do with lack of air conditioning first and then lack of gas, amongst a lot of other things going on right now...). I just got things back online but, well, my day tomorrow is pretty well packed solid.  I wouldn't complain if someone has a few cycles to push these, they look pretty good to me but I won't be able to work on PG stuff again until tomorrow night at the earliest. 

Thanks!

Stephen

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Sat, Aug 5, 2017 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Unfortunately the day got away from me due to some personal... adventures
> (having to do with lack of air conditioning first and then lack of gas,
> amongst a lot of other things going on right now...). I just got things back
> online but, well, my day tomorrow is pretty well packed solid.  I wouldn't
> complain if someone has a few cycles to push these, they look pretty good to
> me but I won't be able to work on PG stuff again until tomorrow night at the
> earliest.

If no other committer wants to take a shot at those patches, it may be
better to push them after the next minor release happens? I don't like
delaying bug fixes, but the release is close by and time flies.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Sat, Aug 5, 2017 at 4:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Aug 5, 2017 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> Unfortunately the day got away from me due to some personal... adventures
>> (having to do with lack of air conditioning first and then lack of gas,
>> amongst a lot of other things going on right now...). I just got things back
>> online but, well, my day tomorrow is pretty well packed solid.  I wouldn't
>> complain if someone has a few cycles to push these, they look pretty good to
>> me but I won't be able to work on PG stuff again until tomorrow night at the
>> earliest.
>
> If no other committer wants to take a shot at those patches, it may be
> better to push them after the next minor release happens? I don't like
> delaying bug fixes, but the release is close by and time flies.

/me reads patches.

If we apply these patches to 9.6, then pg_stop_backup() on a standby
will start writing backup history files and removing no-longer-needed
backup history files.  That's a clear behavior change, and it isn't a
bug fix.  Making the waitforarchive option work is a bug fix, but I'm
not convinced we should take it as a license to change other aspects
of the behavior in a point release.

-                                errhint("WAL control functions cannot
be executed during recovery.")));
+                                errhint("WAL control functions cannot
be executed during recovery; "
+                                                "they should be
executed on the primary instead")));

I don't agree with this change at all.  Executing WAL control
functions on the master cannot reasonably be considered equivalent to
doing it on the standby.  This is like saying "don't take candy from a
baby, take it from an adult instead".  That could be good advice under
the right set of circumstances, but it's also subject to unfortunate
misinterpretation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Aug 5, 2017 at 4:14 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> If no other committer wants to take a shot at those patches, it may be
>> better to push them after the next minor release happens? I don't like
>> delaying bug fixes, but the release is close by and time flies.

> /me reads patches.
> [ assorted objections ]

Given Robert's objections, there's no way we should force these into
Monday's releases.  Safer to spend whatever time is needed to get it
right.
        regards, tom lane



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Sat, Aug 5, 2017 at 9:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> If we apply these patches to 9.6, then pg_stop_backup() on a standby
> will start writing backup history files and removing no-longer-needed
> backup history files.  That's a clear behavior change, and it isn't a
> bug fix.  Making the waitforarchive option work is a bug fix, but I'm
> not convinced we should take it as a license to change other aspects
> of the behavior in a point release.

After refreshing my memory further, I take it back.  pg_stop_backup()
doesn't even have a second argument on v9.6, so back-porting this fix
to 9.6 is a meaningless thing; there's nothing to fix.  What the
patches propose to do there instead is adopt the behavior proposed for
v10 when pg_stop_backup()'s second argument is true as the
unconditional behavior in v9.6.  For that to be the right thing to do,
we have to decide that the current v9.6 behavior is a bug.  But I
(still) think that's very arguable, because the whole point is that
we've just finished adding a flag in v10 to disable on the master the
*very same behavior* we're proposing to mandate on the standby in
v9.6.  How can we say that v9.6's current behavior is a bug when v10
produces the same behavior with wait_for_archive = false?  That just
doesn't make any sense.

I've pushed a minimal fix for v10 which should address Sawada-san's
original complaint: now, if you say wait_for_archive = true on a
standby, you'll get a warning if it didn't wait, or if archive_mode =
always, it will wait.  I think the right thing to do about 9.6 is
document the behavior; there's no problem here that a user can't work
around by doing it right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Sat, Aug 5, 2017 at 4:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Aug 5, 2017 at 9:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If we apply these patches to 9.6, then pg_stop_backup() on a standby
>> will start writing backup history files and removing no-longer-needed
>> backup history files.  That's a clear behavior change, and it isn't a
>> bug fix.  Making the waitforarchive option work is a bug fix, but I'm
>> not convinced we should take it as a license to change other aspects
>> of the behavior in a point release.
>
> After refreshing my memory further, I take it back.  pg_stop_backup()
> doesn't even have a second argument on v9.6, so back-porting this fix
> to 9.6 is a meaningless thing; there's nothing to fix.  What the
> patches propose to do there instead is adopt the behavior proposed for
> v10 when pg_stop_backup()'s second argument is true as the
> unconditional behavior in v9.6.  For that to be the right thing to do,
> we have to decide that the current v9.6 behavior is a bug.  But I
> (still) think that's very arguable, because the whole point is that
> we've just finished adding a flag in v10 to disable on the master the
> *very same behavior* we're proposing to mandate on the standby in
> v9.6.  How can we say that v9.6's current behavior is a bug when v10
> produces the same behavior with wait_for_archive = false?  That just
> doesn't make any sense.

Because default values should be safe in the backup and restore area,
and wait_for_archive = false is not the default. I would like to point
out that the 9.6 behavior has been discussed as being a bug upthread
for 9.6 by three people (David, Sawada-san and I) as there is a real
risk to take inconsistent backups from standbys (a WAL segment may not
be archived when pg_stop_backup reports back, so the user may not have
all the WAL it would need to get back to a consistent state), and that
the default should be to get consistent and safe backups.

> I've pushed a minimal fix for v10 which should address Sawada-san's
> original complaint: now, if you say wait_for_archive = true on a
> standby, you'll get a warning if it didn't wait, or if archive_mode =
> always, it will wait.

Backup history files are around mainly for debugging purposes. While I
don't mind about the choice to not generate them on back-branches, the
inconsistency between primary and standbys in generating them is
really disturbing. We could have taken the occasion to address that
here as this is not invasive, but well... I do complain a lot about
keeping changes going to v10 non-invasive if possible. So no real
complain to do what's been done.

> I think the right thing to do about 9.6 is
> document the behavior; there's no problem here that a user can't work
> around by doing it right.

There are many ways for users to do it wrong in this area, that I am
of the opinion to give them safe defaults if we have a way to make
things work safe in the backend. And here we are talking about extra
checks to make sure that a WAL segment is correctly archived... I have
seen bugs lately in custom backup code which led to inconsistent
backups.
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Sat, Aug 5, 2017 at 12:07 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Because default values should be safe in the backup and restore area,
> and wait_for_archive = false is not the default.

Neither is archive_mode = always, without which wait_for_archive =
true doesn't actually wait.

> I would like to point
> out that the 9.6 behavior has been discussed as being a bug upthread
> for 9.6 by three people (David, Sawada-san and I) as there is a real
> risk to take inconsistent backups from standbys (a WAL segment may not
> be archived when pg_stop_backup reports back, so the user may not have
> all the WAL it would need to get back to a consistent state), and that
> the default should be to get consistent and safe backups.

Sure, but that only happens if you haven't archived the very next WAL
segment yet, which in many environments isn't going to be a problem.
Furthermore, there's also the opposite danger of somebody's backup
script hanging where it currently doesn't.  I think it's just wrong to
say that without this change you don't get consistent backups.  It
just means you have an additional step to do to make sure that you
have all the WAL files you need - which is also true *with* the patch,
because you still have to make sure a WAL rotation happens on the
master.

Besides, if not waiting is so bad, then what about the fact that 9.5,
9.4, 9.3, and 9.2 have the exact same code to not wait, and you're not
proposing to change that?  What about the fact waiting isn't even
well-defined unless standbys support archiving?

> Backup history files are around mainly for debugging purposes. While I
> don't mind about the choice to not generate them on back-branches, the
> inconsistency between primary and standbys in generating them is
> really disturbing. We could have taken the occasion to address that
> here as this is not invasive, but well... I do complain a lot about
> keeping changes going to v10 non-invasive if possible. So no real
> complain to do what's been done.

I'm sorry you're disturbed, but I think that's clearly not a separate
change and not a bug fix.  The current behavior is well-documented,
including in places your patch didn't change, like the pg_basebackup
documentation.

>> I think the right thing to do about 9.6 is
>> document the behavior; there's no problem here that a user can't work
>> around by doing it right.
>
> There are many ways for users to do it wrong in this area, that I am
> of the opinion to give them safe defaults if we have a way to make
> things work safe in the backend. And here we are talking about extra
> checks to make sure that a WAL segment is correctly archived... I have
> seen bugs lately in custom backup code which led to inconsistent
> backups.

I agree that there are many ways for users to do it wrong, and I do
not think that changing critical behavior in minor releases will
result in a reduction in the number of ways for users to do it wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Paul A Jungwirth
Date:
On Sat, Aug 5, 2017 at 7:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> After refreshing my memory further, I take it back.  pg_stop_backup()
> doesn't even have a second argument on v9.6, so back-porting this fix
> to 9.6 is a meaningless thing; there's nothing to fix.

According to the docs at
https://www.postgresql.org/docs/9.6/static/functions-admin.html there
is a one-arg version in 9.6.

I've been watching this thread since I noticed it a few days ago. I
have a system on 9.6 with replication, where the standby uses
`archive_mode: always` so that I can run WAL-E backups from there
instead of from the master. I'm a little worried about my backup
integrity now (It sounds like this might be a "it usually works"
situation). Also, WAL-E currently uses the no-arg pg_stop_backup, but
I was contemplating offering a patch to use non-exclusive backups when
available, both to avoid stopping the standby and also to generate a
tablespace description (which WAL-E requires when restoring with
tablespaces). Just thought I'd offer a real use-case for the
non-exclusive-mode functions on 9.6.

I don't have an opinion on the urgency of back-porting a fix, but if
pg_stop_backup(boolean) allows for inconsistent backups, it does sound
like a problem on 9.6 too.

Paul



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Paul A Jungwirth
Date:
On Sat, Aug 5, 2017 at 1:28 PM, Paul A Jungwirth
<pj@illuminatedcomputing.com> wrote:
> On Sat, Aug 5, 2017 at 7:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> After refreshing my memory further, I take it back.  pg_stop_backup()
>> doesn't even have a second argument on v9.6, so back-porting this fix
>> to 9.6 is a meaningless thing; there's nothing to fix.
>
> According to the docs at
> https://www.postgresql.org/docs/9.6/static/functions-admin.html there
> is a one-arg version in 9.6.

I'm sorry, I just realized you said a *second* argument. Apologies for
the distraction!

Paul



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Robert Haas
Date:
On Sat, Aug 5, 2017 at 4:28 PM, Paul A Jungwirth
<pj@illuminatedcomputing.com> wrote:
> I don't have an opinion on the urgency of back-porting a fix, but if
> pg_stop_backup(boolean) allows for inconsistent backups, it does sound
> like a problem on 9.6 too.

It doesn't.  The talk about inconsistent backups is, I think, not a
very precise way of speaking.  All backups taken by copying the data
directory are inconsistent until you run recovery for long enough to
make them consistent.  What we're talking about is whether it's
guaranteed that all WAL segments needed to reach consistency will be
archived by the time pg_stop_backup() returns.  However, even if
they're not, it doesn't mean that there's anything wrong with your
backup per se; it just means you need to replay WAL files that were
not in the archive at the time pg_stop_backup() completed when
restoring it.

So, for example, if you keep an archive of all of your WAL files for
PITR purposes, you're fine.  IIUC, to have a problem, you have to do
the following:

1. Take a backup from the standby, not the master.
2. Take a backup using pg_start_backup() and pg_stop_backup(), not
pg_basebackup.
3. Instead of keeping all of your WAL files, just keep the absolute
minimum number required to restore the backup.
4. Instead of keeping the WAL files through the LSN returned by
pg_stop_backup(), only keep the ones that were archived before
pg_stop_backup() returned.

All of (1)-(3) are legitimate user choices, although not everyone will
make them.  (4) is unfortunately the procedure recommended by our
documentation, which is where the problem comes in.  I think it's
pretty lame that the documentation recommends ignoring the return
value of pg_stop_backup(); ISTM that it would be very wise to
recommend cross-checking the return value against the WAL files you're
keeping for the backup.  Even if we thought the waiting logic was
working perfectly in every case, having a cross-check on something as
important as backup integrity seems like an awfully good idea.

For example, *even if* we were to apply the patch Michael Paquier
proposed, it doesn't actually do anything except emit a warning when
archive_mode isn't set to always, and that warning could easily be
missed by an automated backup script, and archive_mode=always is
probably not a common setting.  So you still have the same problem in
most cases.  I think the root of this problem is that commit
7117685461af50f50c03f43e6a622284c8d54694 did only a pretty perfunctory
update of the documentation, as the commit message itself admits:
   Only reference documentation is included. The main section on
backup still needs   to be rewritten to cover this, but since that is already scheduled
for a separate   large rewrite, it's not included in this patch.

But it doesn't look like that ever really got done.
https://www.postgresql.org/docs/10/static/continuous-archiving.html#backup-lowlevel-base-backup
really doesn't talk at all about standbys or differences in required
procedures between masters and standbys.  It makes statements that are
flagrantly riduculous in the case of a standby, like claiming that
pg_start_backup() always performs a checkpoint and that pg_stop_backup
"terminates the backup mode and performs an automatic switch to the
next WAL segment."  Well, obviously not.

And at least to me, that's the real bug here.  Somebody needs to go
through and fix this documentation properly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Paul A Jungwirth
Date:
On Sat, Aug 5, 2017 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Aug 5, 2017 at 4:28 PM, Paul A Jungwirth
> <pj@illuminatedcomputing.com> wrote:
>> I don't have an opinion on the urgency of back-porting a fix, but if
>> pg_stop_backup(boolean) allows for inconsistent backups, it does sound
>> like a problem on 9.6 too.
>
> It doesn't.  The talk about inconsistent backups is, I think, not a
> very precise way of speaking.

Thank you for the reassurance and the detailed explanation!

Paul



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

From
Michael Paquier
Date:
On Sun, Aug 6, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> All of (1)-(3) are legitimate user choices, although not everyone will
> make them.  (4) is unfortunately the procedure recommended by our
> documentation, which is where the problem comes in.  I think it's
> pretty lame that the documentation recommends ignoring the return
> value of pg_stop_backup(); ISTM that it would be very wise to
> recommend cross-checking the return value against the WAL files you're
> keeping for the backup.  Even if we thought the waiting logic was
> working perfectly in every case, having a cross-check on something as
> important as backup integrity seems like an awfully good idea.

I would got a little bit more far and say that this is mandatory as
the minimum recovery point that needs to be reached is the LSN
returned by pg_stop_backup(). For backups taken from primaries, this
is a larger deal because XLOG_BACKUP_END may not be present if the
last WAL segment switched is not in the archive. For backups taken
from standbys, the thing is more tricky as the control file should be
backed up last. I would think that the best thing we could do is to
extend pg_stop_backup a bit more so as it returns the control file to
write in the backup using a bytea to hold the data for example.

> I think the root of this problem is that commit
> 7117685461af50f50c03f43e6a622284c8d54694 did only a pretty perfunctory
> update of the documentation, as the commit message itself admits:
>
>     Only reference documentation is included. The main section on
> backup still needs
>     to be rewritten to cover this, but since that is already scheduled
> for a separate
>     large rewrite, it's not included in this patch.
>
> But it doesn't look like that ever really got done.
> https://www.postgresql.org/docs/10/static/continuous-archiving.html#backup-lowlevel-base-backup
> really doesn't talk at all about standbys or differences in required
> procedures between masters and standbys.  It makes statements that are
> flagrantly riduculous in the case of a standby, like claiming that
> pg_start_backup() always performs a checkpoint and that pg_stop_backup
> "terminates the backup mode and performs an automatic switch to the
> next WAL segment."  Well, obviously not.

Yep.

> And at least to me, that's the real bug here.  Somebody needs to go
> through and fix this documentation properly.

So, Magnus? :)
-- 
Michael



Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standbyserver

From
Stephen Frost
Date:
Robert, Michael,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Sun, Aug 6, 2017 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > All of (1)-(3) are legitimate user choices, although not everyone will
> > make them.  (4) is unfortunately the procedure recommended by our
> > documentation, which is where the problem comes in.  I think it's
> > pretty lame that the documentation recommends ignoring the return
> > value of pg_stop_backup(); ISTM that it would be very wise to
> > recommend cross-checking the return value against the WAL files you're
> > keeping for the backup.  Even if we thought the waiting logic was
> > working perfectly in every case, having a cross-check on something as
> > important as backup integrity seems like an awfully good idea.
>
> I would got a little bit more far and say that this is mandatory as
> the minimum recovery point that needs to be reached is the LSN
> returned by pg_stop_backup(). For backups taken from primaries, this
> is a larger deal because XLOG_BACKUP_END may not be present if the
> last WAL segment switched is not in the archive. For backups taken
> from standbys, the thing is more tricky as the control file should be
> backed up last. I would think that the best thing we could do is to
> extend pg_stop_backup a bit more so as it returns the control file to
> write in the backup using a bytea to hold the data for example.

Indeed, getting this all correct isn't trivial and it's really
unfortunate that our documentation in this area is really lacking.
Further, having things not actually work as the documentation claims
isn't exactly helping.

> > I think the root of this problem is that commit
> > 7117685461af50f50c03f43e6a622284c8d54694 did only a pretty perfunctory
> > update of the documentation, as the commit message itself admits:
> >
> >     Only reference documentation is included. The main section on
> > backup still needs
> >     to be rewritten to cover this, but since that is already scheduled
> > for a separate
> >     large rewrite, it's not included in this patch.
> >
> > But it doesn't look like that ever really got done.
> > https://www.postgresql.org/docs/10/static/continuous-archiving.html#backup-lowlevel-base-backup
> > really doesn't talk at all about standbys or differences in required
> > procedures between masters and standbys.  It makes statements that are
> > flagrantly riduculous in the case of a standby, like claiming that
> > pg_start_backup() always performs a checkpoint and that pg_stop_backup
> > "terminates the backup mode and performs an automatic switch to the
> > next WAL segment."  Well, obviously not.
>
> Yep.

Indeed, that was something that I had also heard was being worked on,
but it's really unfortunate that it didn't happen.

> > And at least to me, that's the real bug here.  Somebody needs to go
> > through and fix this documentation properly.
>
> So, Magnus? :)

I continue to be off the opinion that rewriting the documentation in
this case to match what we actually do is pretty unfriendly to users who
have built tools using the documentation at the time.  Evidently, that
ship has sailed, however, but we didn't help things here by only
changing how PG10 works and not also at least updating the documentation
to be accurate.

I've poked Magnus... more directly regarding this and offered to have
someone help with the development of better documentation in this area.
Hopefully we can get the docs in the back-branches fixed for the next
round of minor releases, and call out those updates in the release
notes, at least.

Thanks!

Stephen