Thread: For what should pg_stop_backup wait?

For what should pg_stop_backup wait?

From
"Fujii Masao"
Date:
Hi

In HEAD, pg_stop_backup waits until the history file has been archived.
But, in order to ensure that the last wal file was archived, pg_stop_backup
should wait until not only the history file but also the backup stopping wal
file has been archived, I think.

Because the alphabetic order of the history file and the backup stopping
wal file is not constant.

If the backup starting wal file is the same as the stopping one (that is,
any wal files were not switched during backup), the history file whose
name consists of the starting one is behind the stopping one.
Otherwise, the backup stopping wal file is behind the history file.

Is this worth fixing?

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: For what should pg_stop_backup wait?

From
Simon Riggs
Date:
On Thu, 2008-08-07 at 21:11 +0900, Fujii Masao wrote:

> In HEAD, pg_stop_backup waits until the history file has been archived.
> But, in order to ensure that the last wal file was archived, pg_stop_backup
> should wait until not only the history file but also the backup stopping wal
> file has been archived, I think.
> 
> Because the alphabetic order of the history file and the backup stopping
> wal file is not constant.

> If the backup starting wal file is the same as the stopping one (that is,
> any wal files were not switched during backup), the history file whose
> name consists of the starting one is behind the stopping one.
> Otherwise, the backup stopping wal file is behind the history file.

pg_stop_backup explicitly requests a log switch prior to writing the
history file, so this occurs in all cases. The assumption is wrong.

So yes, we should wait for stopxlogfilename, not histfilepath.

I'll do a patch. Thanks for your input.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: For what should pg_stop_backup wait?

From
Simon Riggs
Date:
On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote:

> I'll do a patch. Thanks for your input.

Please review attached patch.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: For what should pg_stop_backup wait?

From
"Fujii Masao"
Date:
On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote:
>
>> I'll do a patch. Thanks for your input.
>
> Please review attached patch.

Thank you for your patch!

But, there are two problems in this patch, I think.

> !      * Wait until the last WAL file has been archived. We assume that the
> !      * alphabetic sorting property of the WAL files ensures the history
> !      * file is guaranteed archived by the time the last WAL file is archived.
> !      * The history file name depends upon the startpoint, whereas the last
> !      * file depends upon the stoppoint. They are always different because we
> !      * make an explicit xlog switch earlier in this function.

If there are a few transactions during backup, startpoint may be the same as
the stoppoint.
   postgres=# SELECT pg_xlogfile_name(pg_start_backup('test')) AS startpoint;           startpoint
--------------------------   000000010000000000000004   (1 row)
 
   *** A few transaction occurs ***
   postgres=# SELECT pg_xlogfile_name(pg_stop_backup()) AS stoppoint;           stoppoint   --------------------------
 000000010000000000000004   (1 row)
 

In this situation, the history file (000000010000000000000004.00000020.backup)
is behind the stoppoint (000000010000000000000004) in the alphabetic order.
So, pg_stop_backup should wait for both the stoppoint and the history
file, I think.


> !     while (!XLogArchiveCheckDone(stopxlogfilename, false))

If a concurrent checkpoint removes the status file before XLogArchiveCheckDone,
pg_stop_backup continues waiting forever. This is undesirable behavior.
Yes, statement_timeout may help. But, I don't want to use it, because the
*successful* backup is canceled.

How about checking whether the stoppoint was archived by comparing with
the last WAL archived. The archiver process can tell the last WAL archived.
Or, we can calculate it from the status file.

On the other hand, pg_stop_backup doesn't continue waiting for the history file
forever. Because, only pg_stop_backup removes the status file of it,
and a concurrent
pg_stop_backup never happen.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: For what should pg_stop_backup wait?

From
Simon Riggs
Date:
On Fri, 2008-08-08 at 11:47 +0900, Fujii Masao wrote:
> On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >

> In this situation, the history file (000000010000000000000004.00000020.backup)
> is behind the stoppoint (000000010000000000000004) in the alphabetic order.
> So, pg_stop_backup should wait for both the stoppoint and the history
> file, I think.

OK, I see that now.

> 
> > !     while (!XLogArchiveCheckDone(stopxlogfilename, false))
> 
> If a concurrent checkpoint removes the status file before XLogArchiveCheckDone,
> pg_stop_backup continues waiting forever. This is undesirable behavior.

I think it will only get removed by the second checkpoint, not the
first. So the risk of that happening seems almost certainly impossible.
But we'll put in a check just in case.

> Yes, statement_timeout may help. But, I don't want to use it, because the
> *successful* backup is canceled.
> 
> How about checking whether the stoppoint was archived by comparing with
> the last WAL archived. The archiver process can tell the last WAL archived.
> Or, we can calculate it from the status file.

I think its easier to test whether the stopxlogfilename still exists in
pg_xlog. If not, we know it has been archived away. We can add that as
an extra condition inside the loop.

So thinking we should test XLogArchiveCheckDone() for both
stopxlogfilename and history file and then stat for the stop WAL file:


BackupHistoryFileName(histfilepath, ThisTimeLineID, _logId, _logSeg,                      startpoint.xrecoff %
XLogSegSize);
seconds_before_warning = 60;waits = 0;
while (!XLogArchiveCheckDone(histfilepath, false) ||        !XLogArchiveCheckDone(stopxlogfilename, false)){    struct
statstat_buf;    char        xlogpath[MAXPGPATH];
 
    /*     * Check to see if file has already been archived and WAL file     * removed by a concurrent checkpoint
*/   snprintf(xlogpath, MAXPGPATH, XLOGDIR "/%s", stopxlogfilename);    if (XLogArchiveCheckDone(histfilepath, false)
&&       stat(xlogpath, &stat_buf) != 0)        break;
 
    CHECK_FOR_INTERRUPTS();
    pg_usleep(1000000L);
    if (++waits >= seconds_before_warning)    {        seconds_before_warning *= 2;     /* This wraps in >10 years...
*/       elog(WARNING, "pg_stop_backup() waiting for archive to complete "                         "(%d seconds
delay)",waits);    }}
 


-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: For what should pg_stop_backup wait?

From
Simon Riggs
Date:
On Fri, 2008-08-08 at 12:57 +0100, Simon Riggs wrote:

> > Yes, statement_timeout may help. But, I don't want to use it, because the
> > *successful* backup is canceled.
> >
> > How about checking whether the stoppoint was archived by comparing with
> > the last WAL archived. The archiver process can tell the last WAL archived.
> > Or, we can calculate it from the status file.
>
> I think its easier to test whether the stopxlogfilename still exists in
> pg_xlog. If not, we know it has been archived away. We can add that as
> an extra condition inside the loop.
>
> So thinking we should test XLogArchiveCheckDone() for both
> stopxlogfilename and history file and then stat for the stop WAL file:

This seems better.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: For what should pg_stop_backup wait?

From
Simon Riggs
Date:
On Fri, 2008-08-08 at 11:47 +0900, Fujii Masao wrote:
> On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote:
> >
> >> I'll do a patch. Thanks for your input.
> >
> > Please review attached patch.
> 
> Thank you for your patch!

Would you mind if I asked you to be the official reviewer of this patch
for the latest CommitFest? It would help greatly, since you understand
the issue and clearly the code also. Thanks, if possible.
http://wiki.postgresql.org/wiki/CommitFest:2008-09

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: For what should pg_stop_backup wait?

From
"Fujii Masao"
Date:
On Fri, Sep 5, 2008 at 7:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> Would you mind if I asked you to be the official reviewer of this patch
> for the latest CommitFest? It would help greatly, since you understand
> the issue and clearly the code also. Thanks, if possible.

OK. I'll try to review your patch.
Do I need to inform Josh of reviewing it in advance?

regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: For what should pg_stop_backup wait?

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
>> So thinking we should test XLogArchiveCheckDone() for both
>> stopxlogfilename and history file and then stat for the stop WAL file:

> This seems better.

Somehow I missed the 5-Apr patch that introduced this bogosity.
Next time you make a fundamental change in the behavior of a function,
how about updating its comment to match?

After studying it for a minute, I think that XLogArchiveCheckDone no
longer even has a clearly explainable purpose; it needs to be split
into two functions with two different behaviors.  I plan to revert
XLogArchiveCheckDone to its previous behavior and introduce
XLogArchiveIsBusy (flipping the sense of the result) to use in
pg_stop_backup.
        regards, tom lane


Re: For what should pg_stop_backup wait?

From
Simon Riggs
Date:
On Mon, 2008-09-08 at 11:57 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> >> So thinking we should test XLogArchiveCheckDone() for both
> >> stopxlogfilename and history file and then stat for the stop WAL file:
> 
> > This seems better.
> 
> Somehow I missed the 5-Apr patch that introduced this bogosity.
> Next time you make a fundamental change in the behavior of a function,
> how about updating its comment to match?

It felt OK when I did it, because of the clearly named boolean. But I
accept your point and will look to improve my code comments in future.

> After studying it for a minute, I think that XLogArchiveCheckDone no
> longer even has a clearly explainable purpose; it needs to be split
> into two functions with two different behaviors.  I plan to revert
> XLogArchiveCheckDone to its previous behavior and introduce
> XLogArchiveIsBusy (flipping the sense of the result) to use in
> pg_stop_backup.

You sound like you're in the middle of doing this yourself. Or would you
like me to do that?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: For what should pg_stop_backup wait?

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> You sound like you're in the middle of doing this yourself. Or would you
> like me to do that?

Yeah, done and committed.
        regards, tom lane