Thread: For what should pg_stop_backup wait?
Hi In HEAD, pg_stop_backup waits until the history file has been archived. But, in order to ensure that the last wal file was archived, pg_stop_backup should wait until not only the history file but also the backup stopping wal file has been archived, I think. Because the alphabetic order of the history file and the backup stopping wal file is not constant. If the backup starting wal file is the same as the stopping one (that is, any wal files were not switched during backup), the history file whose name consists of the starting one is behind the stopping one. Otherwise, the backup stopping wal file is behind the history file. Is this worth fixing? -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2008-08-07 at 21:11 +0900, Fujii Masao wrote: > In HEAD, pg_stop_backup waits until the history file has been archived. > But, in order to ensure that the last wal file was archived, pg_stop_backup > should wait until not only the history file but also the backup stopping wal > file has been archived, I think. > > Because the alphabetic order of the history file and the backup stopping > wal file is not constant. > If the backup starting wal file is the same as the stopping one (that is, > any wal files were not switched during backup), the history file whose > name consists of the starting one is behind the stopping one. > Otherwise, the backup stopping wal file is behind the history file. pg_stop_backup explicitly requests a log switch prior to writing the history file, so this occurs in all cases. The assumption is wrong. So yes, we should wait for stopxlogfilename, not histfilepath. I'll do a patch. Thanks for your input. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote: > I'll do a patch. Thanks for your input. Please review attached patch. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote: > >> I'll do a patch. Thanks for your input. > > Please review attached patch. Thank you for your patch! But, there are two problems in this patch, I think. > ! * Wait until the last WAL file has been archived. We assume that the > ! * alphabetic sorting property of the WAL files ensures the history > ! * file is guaranteed archived by the time the last WAL file is archived. > ! * The history file name depends upon the startpoint, whereas the last > ! * file depends upon the stoppoint. They are always different because we > ! * make an explicit xlog switch earlier in this function. If there are a few transactions during backup, startpoint may be the same as the stoppoint. postgres=# SELECT pg_xlogfile_name(pg_start_backup('test')) AS startpoint; startpoint -------------------------- 000000010000000000000004 (1 row) *** A few transaction occurs *** postgres=# SELECT pg_xlogfile_name(pg_stop_backup()) AS stoppoint; stoppoint -------------------------- 000000010000000000000004 (1 row) In this situation, the history file (000000010000000000000004.00000020.backup) is behind the stoppoint (000000010000000000000004) in the alphabetic order. So, pg_stop_backup should wait for both the stoppoint and the history file, I think. > ! while (!XLogArchiveCheckDone(stopxlogfilename, false)) If a concurrent checkpoint removes the status file before XLogArchiveCheckDone, pg_stop_backup continues waiting forever. This is undesirable behavior. Yes, statement_timeout may help. But, I don't want to use it, because the *successful* backup is canceled. How about checking whether the stoppoint was archived by comparing with the last WAL archived. The archiver process can tell the last WAL archived. Or, we can calculate it from the status file. On the other hand, pg_stop_backup doesn't continue waiting for the history file forever. Because, only pg_stop_backup removes the status file of it, and a concurrent pg_stop_backup never happen. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2008-08-08 at 11:47 +0900, Fujii Masao wrote: > On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > In this situation, the history file (000000010000000000000004.00000020.backup) > is behind the stoppoint (000000010000000000000004) in the alphabetic order. > So, pg_stop_backup should wait for both the stoppoint and the history > file, I think. OK, I see that now. > > > ! while (!XLogArchiveCheckDone(stopxlogfilename, false)) > > If a concurrent checkpoint removes the status file before XLogArchiveCheckDone, > pg_stop_backup continues waiting forever. This is undesirable behavior. I think it will only get removed by the second checkpoint, not the first. So the risk of that happening seems almost certainly impossible. But we'll put in a check just in case. > Yes, statement_timeout may help. But, I don't want to use it, because the > *successful* backup is canceled. > > How about checking whether the stoppoint was archived by comparing with > the last WAL archived. The archiver process can tell the last WAL archived. > Or, we can calculate it from the status file. I think its easier to test whether the stopxlogfilename still exists in pg_xlog. If not, we know it has been archived away. We can add that as an extra condition inside the loop. So thinking we should test XLogArchiveCheckDone() for both stopxlogfilename and history file and then stat for the stop WAL file: BackupHistoryFileName(histfilepath, ThisTimeLineID, _logId, _logSeg, startpoint.xrecoff % XLogSegSize); seconds_before_warning = 60;waits = 0; while (!XLogArchiveCheckDone(histfilepath, false) || !XLogArchiveCheckDone(stopxlogfilename, false)){ struct statstat_buf; char xlogpath[MAXPGPATH]; /* * Check to see if file has already been archived and WAL file * removed by a concurrent checkpoint */ snprintf(xlogpath, MAXPGPATH, XLOGDIR "/%s", stopxlogfilename); if (XLogArchiveCheckDone(histfilepath, false) && stat(xlogpath, &stat_buf) != 0) break; CHECK_FOR_INTERRUPTS(); pg_usleep(1000000L); if (++waits >= seconds_before_warning) { seconds_before_warning *= 2; /* This wraps in >10 years... */ elog(WARNING, "pg_stop_backup() waiting for archive to complete " "(%d seconds delay)",waits); }} -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-08-08 at 12:57 +0100, Simon Riggs wrote: > > Yes, statement_timeout may help. But, I don't want to use it, because the > > *successful* backup is canceled. > > > > How about checking whether the stoppoint was archived by comparing with > > the last WAL archived. The archiver process can tell the last WAL archived. > > Or, we can calculate it from the status file. > > I think its easier to test whether the stopxlogfilename still exists in > pg_xlog. If not, we know it has been archived away. We can add that as > an extra condition inside the loop. > > So thinking we should test XLogArchiveCheckDone() for both > stopxlogfilename and history file and then stat for the stop WAL file: This seems better. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
On Fri, 2008-08-08 at 11:47 +0900, Fujii Masao wrote: > On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote: > > > >> I'll do a patch. Thanks for your input. > > > > Please review attached patch. > > Thank you for your patch! Would you mind if I asked you to be the official reviewer of this patch for the latest CommitFest? It would help greatly, since you understand the issue and clearly the code also. Thanks, if possible. http://wiki.postgresql.org/wiki/CommitFest:2008-09 -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, Sep 5, 2008 at 7:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Would you mind if I asked you to be the official reviewer of this patch > for the latest CommitFest? It would help greatly, since you understand > the issue and clearly the code also. Thanks, if possible. OK. I'll try to review your patch. Do I need to inform Josh of reviewing it in advance? regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Simon Riggs <simon@2ndquadrant.com> writes: >> So thinking we should test XLogArchiveCheckDone() for both >> stopxlogfilename and history file and then stat for the stop WAL file: > This seems better. Somehow I missed the 5-Apr patch that introduced this bogosity. Next time you make a fundamental change in the behavior of a function, how about updating its comment to match? After studying it for a minute, I think that XLogArchiveCheckDone no longer even has a clearly explainable purpose; it needs to be split into two functions with two different behaviors. I plan to revert XLogArchiveCheckDone to its previous behavior and introduce XLogArchiveIsBusy (flipping the sense of the result) to use in pg_stop_backup. regards, tom lane
On Mon, 2008-09-08 at 11:57 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> So thinking we should test XLogArchiveCheckDone() for both > >> stopxlogfilename and history file and then stat for the stop WAL file: > > > This seems better. > > Somehow I missed the 5-Apr patch that introduced this bogosity. > Next time you make a fundamental change in the behavior of a function, > how about updating its comment to match? It felt OK when I did it, because of the clearly named boolean. But I accept your point and will look to improve my code comments in future. > After studying it for a minute, I think that XLogArchiveCheckDone no > longer even has a clearly explainable purpose; it needs to be split > into two functions with two different behaviors. I plan to revert > XLogArchiveCheckDone to its previous behavior and introduce > XLogArchiveIsBusy (flipping the sense of the result) to use in > pg_stop_backup. You sound like you're in the middle of doing this yourself. Or would you like me to do that? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > You sound like you're in the middle of doing this yourself. Or would you > like me to do that? Yeah, done and committed. regards, tom lane