Thread: pg_stop_backup does not complete
Simon, Fujii, All: While demoing HS/SR at SCALE, I ran into a problem which is likely to be a commonly encountered bug when people first setup HS/SR. Here's the sequence: 1) Set up a brand new master with an archive-commmand and archive=on. 2) Start the master 3) Do a pg_start_backup() 4) Realize, based on log error messages, that I've misconfigured the archive_command. 5) Attempt to shut down the master. Master tells me that pg_stop_backup must be run in order to shut down. 6) Execute pg_stop_backup. 7) pg_stop_backup waits forever without ever stopping backup. Ever 60 seconds, it give me a helpful "still waiting" message, but at least in the amount of time I was willing to wait (5 minutes), it never completed. 8) do an immediate shutdown, as it's the only way I can get the database unstuck. With some experimentation, the problem seems to occur when you have a failing archive_command and a master which currently has no database traffic; for example, if I did some database write activity (a createdb) then pg_stop_backup would complete after about 60 seconds (which, btw, is extremely annoying, but at least tolerable). This issue is 100% reproduceable. --Josh Berkus
> This issue is 100% reproduceable. Oh, btw, this is on Alpha4. --Josh Berkus
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote: > Simon, Fujii, All: > > While demoing HS/SR at SCALE, I ran into a problem which is likely to be > a commonly encountered bug when people first setup HS/SR. Here's the > sequence: > > 1) Set up a brand new master with an archive-commmand and archive=on. > > 2) Start the master > > 3) Do a pg_start_backup() > > 4) Realize, based on log error messages, that I've misconfigured the > archive_command. > > 5) Attempt to shut down the master. Master tells me that pg_stop_backup > must be run in order to shut down. If I issue a shutdown, PostgreSQL should do whatever it needs to do to shutdown; including issuing a pg_stop_backup. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
"Joshua D. Drake" <jd@commandprompt.com> wrote: > If I issue a shutdown, PostgreSQL should do whatever it needs to > do to shutdown; including issuing a pg_stop_backup. Should we have a pg_fail_backup function, so that it doesn't put out a file which suggests that we have a complete backup? -Kevin
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote: > 1) Set up a brand new master with an archive-commmand and archive=on. > > 2) Start the master > > 3) Do a pg_start_backup() > > 4) Realize, based on log error messages, that I've misconfigured the > archive_command. > 5) Attempt to shut down the master. Master tells me that pg_stop_backup > must be run in order to shut down. > > 6) Execute pg_stop_backup. > > 7) pg_stop_backup waits forever without ever stopping backup. Ever 60 > seconds, it give me a helpful "still waiting" message, but at least in > the amount of time I was willing to wait (5 minutes), it never completed. > > 8) do an immediate shutdown, as it's the only way I can get the database > unstuck. > > With some experimentation, the problem seems to occur when you have a > failing archive_command and a master which currently has no database > traffic; for example, if I did some database write activity (a createdb) > then pg_stop_backup would complete after about 60 seconds (which, btw, > is extremely annoying, but at least tolerable). > > This issue is 100% reproduceable. IMHO there in no problem in that behaviour. If somebody requests a backup then we should wait for it to complete. Kevin's suggestion of pg_fail_backup() is the only sensible conclusion there because it gives an explicit way out of deadlock. ISTM the problem is that you didn't test. Steps 3 and 4 should have been reversed. Perhaps we should put something in the docs to say "and test". The correct resolution is to put in an archive_command that works. We can put in an extra step to prevent a pg_start_backup() if there are a significant number of outstanding files to be archived. Doing that seems like closing the door after the horse has bolted, since we just introduced streaming replication that doesn't rely on archived files. In any case, I don't see many people working on a production system hitting a problem on an archive_command and then deciding to shut down. So I don't see this as something that needs fixing for 9.0. There is already too much non-essential code there, all of which needs to be tested. I don't think adding in new corner cases to "help" people makes any sense until we have automated testing that allows us to rerun the regression tests to check all this stuff still works. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2010-02-23 at 18:58 +0000, Simon Riggs wrote: > On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote: > > This issue is 100% reproduceable. > > IMHO there in no problem in that behaviour. If somebody requests a > backup then we should wait for it to complete. Kevin's suggestion of > pg_fail_backup() is the only sensible conclusion there because it gives > an explicit way out of deadlock. > > ISTM the problem is that you didn't test. Steps 3 and 4 should have been > reversed. Perhaps we should put something in the docs to say "and test". > The correct resolution is to put in an archive_command that works. The problem isn't that it is a bad archive_command, it is that PostgreSQL has no way to deal with this gracefully. Yes people should test but are we dealing with the real world or not? > > So I don't see this as something that needs fixing for 9.0. There is > already too much non-essential code there, all of which needs to be > tested. I don't think adding in new corner cases to "help" people makes > any sense until we have automated testing that allows us to rerun the > regression tests to check all this stuff still works. This will bite us if we release like this. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Simon Riggs <simon@2ndQuadrant.com> wrote: > The correct resolution is to put in an archive_command that works. One really should ensure that WAL files (or should I now say data? ;-) are flowing before issuing running the pg_start_backup() function. The documentation has always been pretty explicit about that: http://www.postgresql.org/docs/8.4/interactive/continuous-archiving.html | 24.3.2. Making a Base Backup | | The procedure for making a base backup is relatively simple: | | 1. Ensure that WAL archiving is enabled and working. | | 2. Connect to the database as a superuser, and issue the command: | | SELECT pg_start_backup('label'); | ... As long as the SR documentation is equally explicit on this point, you'd have to be blatantly going against the instructions to hit this. Which makes me think that while pg_fail_backup() might actually be a good idea, it's not really needed to solve this, so it's 9.1 material at best. -Kevin
On Tue, Feb 23, 2010 at 06:58:22PM +0000, Simon Riggs wrote: > On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote: > > > 1) Set up a brand new master with an archive-commmand and > > archive=on. > > > > 2) Start the master > > > > 3) Do a pg_start_backup() > > > > 4) Realize, based on log error messages, that I've misconfigured > > the archive_command. > > > 5) Attempt to shut down the master. Master tells me that > > pg_stop_backup must be run in order to shut down. > > > > 6) Execute pg_stop_backup. > > > > 7) pg_stop_backup waits forever without ever stopping backup. > > Ever 60 seconds, it give me a helpful "still waiting" message, but > > at least in the amount of time I was willing to wait (5 minutes), > > it never completed. > > > > 8) do an immediate shutdown, as it's the only way I can get the > > database unstuck. > > > > With some experimentation, the problem seems to occur when you > > have a failing archive_command and a master which currently has no > > database traffic; for example, if I did some database write > > activity (a createdb) then pg_stop_backup would complete after > > about 60 seconds (which, btw, is extremely annoying, but at least > > tolerable). > > > > This issue is 100% reproduceable. > > IMHO there in no problem in that behaviour. If somebody requests a > backup then we should wait for it to complete. Kevin's suggestion of > pg_fail_backup() is the only sensible conclusion there because it > gives an explicit way out of deadlock. > > ISTM the problem is that you didn't test. Steps 3 and 4 should have > been reversed. Perhaps we should put something in the docs to say > "and test". The correct resolution is to put in an archive_command > that works. +1 for clarifying and extending the docs. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, 2010-02-23 at 11:24 -0800, Joshua D. Drake wrote: > This will bite us if we release like this. No it won't. The current behaviour was put there by user request a few releases back. This isn't a 9.0 issue, and as I've said its addressing something that we now longer see as mainstream going forwards. There are plenty of things that will bite us, but not this. -- Simon Riggs www.2ndQuadrant.com
On Tue, Feb 23, 2010 at 12:52 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote: >> Simon, Fujii, All: >> >> While demoing HS/SR at SCALE, I ran into a problem which is likely to be >> a commonly encountered bug when people first setup HS/SR. Here's the >> sequence: >> >> 1) Set up a brand new master with an archive-commmand and archive=on. >> >> 2) Start the master >> >> 3) Do a pg_start_backup() >> >> 4) Realize, based on log error messages, that I've misconfigured the >> archive_command. >> >> 5) Attempt to shut down the master. Master tells me that pg_stop_backup >> must be run in order to shut down. > > If I issue a shutdown, PostgreSQL should do whatever it needs to do to > shutdown; including issuing a pg_stop_backup. Maybe. But for sure, if it doesn't, and instead tells the user to issue pg_stop_backup(), then pg_stop_backup() had better WORK when the user tries to execute it. I gather that the problem is that it has to finish all that outstanding archiving before shutting down, in which case Kevin's suggestion of having a command to abort the backup seems reasonable. I might call it pg_abort_backup() rather than pg_fail_backup(), but... ...Robert
On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote: > > If I issue a shutdown, PostgreSQL should do whatever it needs to do to > > shutdown; including issuing a pg_stop_backup. > > Maybe. But for sure, if it doesn't, and instead tells the user to > issue pg_stop_backup(), then pg_stop_backup() had better WORK when the > user tries to execute it. Right. > I gather that the problem is that it has to > finish all that outstanding archiving before shutting down, in which > case Kevin's suggestion of having a command to abort the backup seems > reasonable. I might call it pg_abort_backup() rather than > pg_fail_backup(), but... > But...? Joshua D. Drake > ...Robert > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Tue, Feb 23, 2010 at 3:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote: > >> > If I issue a shutdown, PostgreSQL should do whatever it needs to do to >> > shutdown; including issuing a pg_stop_backup. >> >> Maybe. But for sure, if it doesn't, and instead tells the user to >> issue pg_stop_backup(), then pg_stop_backup() had better WORK when the >> user tries to execute it. > > Right. > >> I gather that the problem is that it has to >> finish all that outstanding archiving before shutting down, in which >> case Kevin's suggestion of having a command to abort the backup seems >> reasonable. I might call it pg_abort_backup() rather than >> pg_fail_backup(), but... >> > > But...? But it seems like a good idea other than that. ...Robert
On Wed, Feb 24, 2010 at 4:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Maybe. But for sure, if it doesn't, and instead tells the user to > issue pg_stop_backup(), then pg_stop_backup() had better WORK when the > user tries to execute it. I gather that the problem is that it has to > finish all that outstanding archiving before shutting down, in which > case Kevin's suggestion of having a command to abort the backup seems > reasonable. I might call it pg_abort_backup() rather than > pg_fail_backup(), but... Or how about adding new boolean parameter of pg_stop_backup() that specifies whether WAL archiving needs to be waited? pg_stop_backup([wait boolean]) This parameter is optional. If true (default), it waits for archiving. In warm-standby and SR, we don't need to wait for archiving before starting the standby from the base backup. So pg_stop_backup(false) would be useful for speedup of setup of log-shipping. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote: > Simon, Fujii, All: > > While demoing HS/SR at SCALE, I ran into a problem which is likely to be > a commonly encountered bug when people first setup HS/SR. Here's the > sequence: > > 1) Set up a brand new master with an archive-commmand and archive=on. > > 2) Start the master > > 3) Do a pg_start_backup() > > 4) Realize, based on log error messages, that I've misconfigured the > archive_command. > > 5) Attempt to shut down the master. Master tells me that pg_stop_backup > must be run in order to shut down. If I issue a shutdown, PostgreSQL should do whatever it needs to do to shutdown; including issuing a pg_stop_backup. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On 2/23/10 10:58 AM, Simon Riggs wrote: > So I don't see this as something that needs fixing for 9.0. There is > already too much non-essential code there, all of which needs to be > tested. I don't think adding in new corner cases to "help" people makes > any sense until we have automated testing that allows us to rerun the > regression tests to check all this stuff still works. So, you're going to personally field the roughly 10,000 bug reports we get on pgsql-general about this behaviour? 24/7? The fact that we ran into this issue on the *first* day of testing the new alpha4 is indicative of how common it will be -- it is not a corner case, it is a common setup error which will affect probably 20% of new users who try 9.0. And new users are going to panic when they can't shut down postgresql, not just test for issues. Any situation where postgresql cannot be safely shut down because of a common setup mistake (typoing an archive_command) is, IMNSHO, not something we can release with. --Josh Berkus
On Tue, 2010-02-23 at 18:58 +0000, Simon Riggs wrote: > On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote: > > This issue is 100% reproduceable. > > IMHO there in no problem in that behaviour. If somebody requests a > backup then we should wait for it to complete. Kevin's suggestion of > pg_fail_backup() is the only sensible conclusion there because it gives > an explicit way out of deadlock. > > ISTM the problem is that you didn't test. Steps 3 and 4 should have been > reversed. Perhaps we should put something in the docs to say "and test". > The correct resolution is to put in an archive_command that works. The problem isn't that it is a bad archive_command, it is that PostgreSQL has no way to deal with this gracefully. Yes people should test but are we dealing with the real world or not? > > So I don't see this as something that needs fixing for 9.0. There is > already too much non-essential code there, all of which needs to be > tested. I don't think adding in new corner cases to "help" people makes > any sense until we have automated testing that allows us to rerun the > regression tests to check all this stuff still works. This will bite us if we release like this. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote: > > If I issue a shutdown, PostgreSQL should do whatever it needs to do to > > shutdown; including issuing a pg_stop_backup. > > Maybe. But for sure, if it doesn't, and instead tells the user to > issue pg_stop_backup(), then pg_stop_backup() had better WORK when the > user tries to execute it. Right. > I gather that the problem is that it has to > finish all that outstanding archiving before shutting down, in which > case Kevin's suggestion of having a command to abort the backup seems > reasonable. I might call it pg_abort_backup() rather than > pg_fail_backup(), but... > But...? Joshua D. Drake > ...Robert > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Tue, 2010-02-23 at 17:46 -0800, Josh Berkus wrote: > On 2/23/10 10:58 AM, Simon Riggs wrote: > > So I don't see this as something that needs fixing for 9.0. There is > > already too much non-essential code there, all of which needs to be > > tested. I don't think adding in new corner cases to "help" people makes > > any sense until we have automated testing that allows us to rerun the > > regression tests to check all this stuff still works. > > So, you're going to personally field the roughly 10,000 bug reports we > get on pgsql-general about this behaviour? 24/7? > The fact that we ran into this issue on the *first* day of testing the > new alpha4 is indicative of how common it will be -- it is not a corner > case, it is a common setup error which will affect probably 20% of new > users who try 9.0. And new users are going to panic when they can't > shut down postgresql, not just test for issues. > > Any situation where postgresql cannot be safely shut down because of a > common setup mistake (typoing an archive_command) is, IMNSHO, not > something we can release with. It's not a common setup mistake. Nothing changed in this release and this has never been reported before. The behaviour to wait for pg_stop_backup() was added by user request. The behaviour for shutdown to wait for pg_stop_backup() was also added by user request. Your mistake was not typoing an archive_command, it was not correctly testing that what you had done was actually working. The fix is to read the manual and correct the typo. Shutting down the server after failing to configure it is not likely to be a normal reaction to experiencing an error in configuration. Better docs might help you, but I doubt it. ISTM you should collect test reports, then analyse and prioritise them. This rates pretty low for me: low severity, low frequency. (If new users panic when they can't do shutdown the server, they probably won't like smart shutdown very much either.) -- Simon Riggs www.2ndQuadrant.com
Simon, > It's not a common setup mistake. Nothing changed in this release and > this has never been reported before. > > The behaviour to wait for pg_stop_backup() was added by user request. > The behaviour for shutdown to wait for pg_stop_backup() was also added > by user request. Your two statements above contradict each other. And, while it makes sense for smart shutdown to wait for pg_stop_backup(), it does not make sense for fast shutdown to wait. Aside from that, the main issue is not having shutdown wait for pg_stop_backup; it's pg_stop_backup never completing. An issue, I'll note, you're ignoring. If you're going to be this defensive whenever anyone reports a bug, it's going to be veeeeeeery slow going to troubleshoot HS. As Robert Haas said: "But for sure, if it doesn't, and instead tells the user to issue pg_stop_backup(), then pg_stop_backup() had better WORK when the user tries to execute it." > Your mistake was not typoing an archive_command, it was not correctly > testing that what you had done was actually working. The fix is to read > the manual and correct the typo. Shutting down the server after failing > to configure it is not likely to be a normal reaction to experiencing an > error in configuration. The problem is you're thinking of an experienced PostgreSQL DBA doing setup on a production server. That's not what I'm talking about. I'm talking about the thousands of new users who are going to try PostgreSQL for the first time because of HS/SR on a test installation. If they encounter this issue, they will decide (again) that PostgreSQL is too hard to use and give up on us for another 5 years. We've spent the last few years overcoming the image of PostgreSQL being too complicated for most people to use. You seem hell-bent on restoring it. Given the timing, our project has one chance to establish a new reputation as the SQL database for everybody. User-hostile behavior like this will ruin that chance. Saying "RTFM and test, you newbie!" is not a valid response, and that's what your "you should have read the docs" amounts to. Heck, I *did* read the docs. > ISTM you should collect test reports, then analyse and prioritise them. > This rates pretty low for me: low severity, low frequency. To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of LAPUG all find this highly problematic behavior. So consider it 6 problem reports, not just one. --Josh Berkus
On Wed, 2010-02-24 at 10:07 -0800, Josh Berkus wrote: > Simon, > > Your mistake was not typoing an archive_command, it was not correctly > > testing that what you had done was actually working. The fix is to read > > the manual and correct the typo. Shutting down the server after failing > > to configure it is not likely to be a normal reaction to experiencing an > > error in configuration. > > The problem is you're thinking of an experienced PostgreSQL DBA doing > setup on a production server. That's not what I'm talking about. I'm > talking about the thousands of new users who are going to try PostgreSQL > for the first time because of HS/SR on a test installation. If they > encounter this issue, they will decide (again) that PostgreSQL is too > hard to use and give up on us for another 5 years. Shoot forget the "new users", I am thinking about the hundreds of thousands of existing NOT DBA users. E.g; 90% of our user base. > > Saying "RTFM and test, you newbie!" is not a valid response, and that's > what your "you should have read the docs" amounts to. Heck, I *did* > read the docs. Agreed. Although RTFM is important, we shouldn't have RTFM for something that is clearly a user visible behavior mistake on our part. > > > ISTM you should collect test reports, then analyse and prioritise them. > > This rates pretty low for me: low severity, low frequency. > > To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of > LAPUG all find this highly problematic behavior. So consider it 6 > problem reports, not just one. > Basically the reports boil down to people who are actually going to be dealing with this in the field. Simon with respect you have been 6 feet deep in code for too long on this. You need to step back and take some constructive feedback from those that are dealing with the production issues and do so with a smile. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Wed, Feb 24, 2010 at 1:07 PM, Josh Berkus <josh@agliodbs.com> wrote: > And, while it makes sense for smart shutdown to wait for > pg_stop_backup(), it does not make sense for fast shutdown to wait. TFM in fact says: http://www.postgresql.org/docs/8.4/static/app-pg-ctl.html#APP-PG-CTL-DESCRIPTION In stop mode, the server that is running in the specified data directory is shut down. Three different shutdown methods can be selected with the -m option: "Smart" mode waits for online backup mode to finish and all the clients to disconnect. This is the default. "Fast" mode does not wait for clients to disconnect and will terminate an online backup in progress. All active transactions are rolled back and clients are forcibly disconnected, then the server is shut down. "Immediate" mode will abort all server processes without a clean shutdown. This will lead to a recovery run on restart. Your OP was not too clear about whether you tried a smart shutdown or a fast shutdown, but if you meant a fast shutdown, this is apparently (he says without testing) a regression. ...Robert
> Your OP was not too clear about whether you tried a smart shutdown or > a fast shutdown, but if you meant a fast shutdown, this is apparently > (he says without testing) a regression. Ah, sorry. Yes, I attempted a fast shutdown. --Josh Berkus
Josh Berkus wrote: > And, while it makes sense for smart shutdown to wait for > pg_stop_backup(), it does not make sense for fast shutdown to wait. Hang on, fast shutdown does *not* wait for backup to finish. > Aside from that, the main issue is not having shutdown wait for > pg_stop_backup; it's pg_stop_backup never completing. An issue, I'll > note, you're ignoring. Ahh, that's a detail I missed too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-02-24 at 10:07 -0800, Josh Berkus wrote: > > > > The behaviour to wait for pg_stop_backup() was added by user request. > > The behaviour for shutdown to wait for pg_stop_backup() was also added > > by user request. > > Your two statements above contradict each other. No they don't. > And, while it makes sense for smart shutdown to wait for > pg_stop_backup(), it does not make sense for fast shutdown to wait. > > Aside from that, the main issue is not having shutdown wait for > pg_stop_backup; it's pg_stop_backup never completing. An issue, I'll > note, you're ignoring. If you're going to be this defensive whenever > anyone reports a bug, it's going to be veeeeeeery slow going to > troubleshoot HS. I haven't ignored the issue. The behaviour you are complaining about was put there following complaints that it didn't wait. You're ignoring the point that there hasn't been any change in this release and so your comments are unfounded in reality. > To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of > LAPUG all find this highly problematic behavior. So consider it 6 > problem reports, not just one. :-) I'm told that ignoring user groups is OK... If you're going to address single issues rather than prioritise what is important over what is not, you will get strange responses. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote: > Basically the reports boil down to people who are actually going to be > dealing with this in the field. Simon with respect you have been 6 feet > deep in code for too long on this. You need to step back and take some > constructive feedback from those that are dealing with the production > issues and do so with a smile. I receive constructive feedback all the time from the many users I deal personally and directly with each week. You make the mistake of assuming that someone that can develop has no solution experience. That is exactly how I fund further development, so you are off base by a long way. The way this works currently is based on production feedback. This post is about non-production usage. Until someone comes up with a truly constructive suggestion that takes account of the issues that cause the current design, it won't get traction with me. -- Simon Riggs www.2ndQuadrant.com
On 2/24/10 10:40 AM, Heikki Linnakangas wrote: > Josh Berkus wrote: >> And, while it makes sense for smart shutdown to wait for >> pg_stop_backup(), it does not make sense for fast shutdown to wait. > > Hang on, fast shutdown does *not* wait for backup to finish. It did when I tried it. I'll test to see what combination of factors produces that. >> Aside from that, the main issue is not having shutdown wait for >> pg_stop_backup; it's pg_stop_backup never completing. An issue, I'll >> note, you're ignoring. > > Ahh, that's a detail I missed too. Yeah, that's the important one. I went through the sequence: 1) Try to shut down. 2) be told to run pg_stop_backup() 3) run pg_stop_backup() 4) pg_stop_backup never completes. Look at the original bug report on this thread; it has the details. I think it's still the issue that if no logs are being written (database is idle) pg_stop_backup does not complete, which I thought we fixed, but maybe not? --Josh Berkus
> I haven't ignored the issue. The behaviour you are complaining about was > put there following complaints that it didn't wait. You're ignoring the > point that there hasn't been any change in this release and so your > comments are unfounded in reality. I've posted a reproduceable bug (pg_stop_backup never terminating). Either say that you tried to reproduce it and failed, or accept that it exists. Saying "that bug is impossible" is the denial of reality. To reiterate yet again, the problem is that pg_stop_backup never completes. What we do on shutdown is a side issue. --Josh Berkus
On Wed, 2010-02-24 at 11:07 -0800, Josh Berkus wrote: > > I haven't ignored the issue. The behaviour you are complaining about was > > put there following complaints that it didn't wait. You're ignoring the > > point that there hasn't been any change in this release and so your > > comments are unfounded in reality. > > I've posted a reproduceable bug (pg_stop_backup never terminating). > Either say that you tried to reproduce it and failed, or accept that it > exists. Saying "that bug is impossible" is the denial of reality. You haven't posted a reproduceable bug, nor is this new to 9.0. You have just noticed a production feature that was specifically put there by user request. The feature exists, has done for some time now and it's acting as it should. This is about what happens in production, not your laptop. The required behaviour in-production is to assume that the sysadmin has configured it correctly and we wait for them to fix the problem. The previous complaints were from people who felt they wanted to avoid invalid backups. Personally, I'd say there were many issues that are new to 9.0 that really are important, and that this isn't one of them. -- Simon Riggs www.2ndQuadrant.com
> You haven't posted a reproduceable bug, nor is this new to 9.0. Yes, I have. 1) set up a failing archive_command on an idle database 2) do pg_start_backup 3) do pg_stop_backup 4) pg_stop_backup waits forever (or at least 5 minutes, which as long as I've given it so far). > This is about what happens in production, not your laptop. The required > behaviour in-production is to assume that the sysadmin has configured it > correctly and we wait for them to fix the problem. 90% of our user base does not have a sysadmin. Or, for that matter, even a professional DBA. > The previous > complaints were from people who felt they wanted to avoid invalid > backups. People don't deploy PostgreSQL in production in the first place if it has this kind of "no good option from here" failure when they first try it. HS/SR is for use by new users of PostgreSQL as well as the experienced. --Josh Berkus
On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote: > On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote: > You make the mistake of assuming that someone that can develop has no > solution experience. That is exactly how I fund further development, so > you are off base by a long way. I never implied that. I implied that your perspective is currently skewed. I stand by that implication. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Wed, 2010-02-24 at 11:31 -0800, Josh Berkus wrote: > > This is about what happens in production, not your laptop. The required > > behaviour in-production is to assume that the sysadmin has configured it > > correctly and we wait for them to fix the problem. > > 90% of our user base does not have a sysadmin. Or, for that matter, > even a professional DBA. Your logic is terrible. If there is no sysadmin, who would be typing the pg_stop_backup() ? Who would have misconfigured it in the first place? If you have a concrete proposal, get off your soapbox and make one, based upon the technical information you've received. There are clear reasons why things are the way they are and those reasons will not be ignored, by me. -- Simon Riggs www.2ndQuadrant.com
Simon, > If you have a concrete proposal, get off your soapbox and make one, > based upon the technical information you've received. There are clear > reasons why things are the way they are and those reasons will not be > ignored, by me. OK, can you go through the reasons why pg_stop_backup would not complete? And why it's a problem to have it complete? I'll admit to not understanding them; it seems to me that pg_stop_backup should just immediately force a checkpoint and a log write, but you're obviously trying to prevent something with the current behavior. What are you trying to prevent? --Josh Berkus
On 2/24/10 11:55 AM, Joshua D. Drake wrote: > On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote: >> On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote: > >> You make the mistake of assuming that someone that can develop has no >> solution experience. That is exactly how I fund further development, so >> you are off base by a long way. > > I never implied that. I implied that your perspective is currently > skewed. I stand by that implication. Let's kill the ad-hominem attacks guys. Not productive. Thanks. --Josh Berkus
Josh Berkus wrote: > OK, can you go through the reasons why pg_stop_backup would not > complete? pg_stop_backup() doesn't complete until all the WAL segments needed to restore from the backup are archived. If archive_command is failing, that never happens. > And why it's a problem to have it complete? Because then you would conclude that the backup is finished and you have all the data you need to restore safely in the archive. If archive_command is failing, that's not happening. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> pg_stop_backup() doesn't complete until all the WAL segments needed to > restore from the backup are archived. If archive_command is failing, > that never happens. OK, so we need a way out of that cycle if the user is issuing pg_stop_backup because they *already know* that archive_command is failing. Right now, there's no way out other than a fast shutdown, which is a bit user-hostile. --Josh Berkus
On Wed, 2010-02-24 at 12:32 -0800, Josh Berkus wrote: > > pg_stop_backup() doesn't complete until all the WAL segments needed to > > restore from the backup are archived. If archive_command is failing, > > that never happens. > > OK, so we need a way out of that cycle if the user is issuing > pg_stop_backup because they *already know* that archive_command is > failing. Right now, there's no way out other than a fast shutdown, > which is a bit user-hostile. Hmmm well... changing the archive_command to /bin/true and issuing a HUP would cause the command to succeed, but I still think that is over the top. I prefer Kevin's solution or some variant thereof: http://archives.postgresql.org/pgsql-hackers/2010-02/msg01853.php http://archives.postgresql.org/pgsql-hackers/2010-02/msg01907.php Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Josh Berkus wrote: >> pg_stop_backup() doesn't complete until all the WAL segments needed to >> restore from the backup are archived. If archive_command is failing, >> that never happens. > > OK, so we need a way out of that cycle if the user is issuing > pg_stop_backup because they *already know* that archive_command is > failing. Right now, there's no way out other than a fast shutdown, Sure there is. Just kill the session, Ctrl-c or similar. pg_stop_backup() isn't actually doing anything at that point anymore; it's just waiting for the files to be archived before returning. Or fix archive_command, and pg_reload_conf(). BTW, if you want a timeout for that, you can use statement_timeout. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Josh Berkus <josh@agliodbs.com> wrote: >> pg_stop_backup() doesn't complete until all the WAL segments >> needed to restore from the backup are archived. If >> archive_command is failing, that never happens. > > OK, so we need a way out of that cycle if the user is issuing > pg_stop_backup because they *already know* that archive_command is > failing. Right now, there's no way out other than a fast > shutdown, which is a bit user-hostile. So maybe pg_abort_backup() is needed for 9.0 after all? (1) You'd want to be able to run it either instead of pg_stop_backup or to interrupt a pending one. (2) You wouldn't want the .backup file to be written. (3) What about the equivalent WAL end-of-backup record? -Kevin
On Wed, 2010-02-24 at 11:55 -0800, Joshua D. Drake wrote: > On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote: > > On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote: > > > You make the mistake of assuming that someone that can develop has no > > solution experience. That is exactly how I fund further development, so > > you are off base by a long way. > > I never implied that. I implied that your perspective is currently > skewed. I stand by that implication. My perspective comes from knowing the code AND having production experience with PostgreSQL many times over. -- Simon Riggs www.2ndQuadrant.com
Josh Berkus <josh@agliodbs.com> writes: >> pg_stop_backup() doesn't complete until all the WAL segments needed to >> restore from the backup are archived. If archive_command is failing, >> that never happens. > OK, so we need a way out of that cycle if the user is issuing > pg_stop_backup because they *already know* that archive_command is > failing. Right now, there's no way out other than a fast shutdown, > which is a bit user-hostile. The pg_abort_backup() operation previously proposed seems like the only workable compromise. Simon is quite right to not want pg_stop_backup() to behave in a way that could contribute to data loss; but on the other hand there needs to be some clear way to get the system out of that state at need. regards, tom lane
On Feb 24, 2010, at 12:47 PM, Tom Lane wrote: >> OK, so we need a way out of that cycle if the user is issuing >> pg_stop_backup because they *already know* that archive_command is >> failing. Right now, there's no way out other than a fast shutdown, >> which is a bit user-hostile. > > The pg_abort_backup() operation previously proposed seems like the only > workable compromise. Simon is quite right to not want pg_stop_backup() > to behave in a way that could contribute to data loss; but on the other > hand there needs to be some clear way to get the system out of that > state at need. +1 makes sense. David
Josh Berkus wrote: >> pg_stop_backup() doesn't complete until all the WAL segments needed to >> restore from the backup are archived. If archive_command is failing, >> that never happens. >> > > OK, so we need a way out of that cycle if the user is issuing > pg_stop_backup because they *already know* that archive_command is > failing. Right now, there's no way out other than a fast shutdown, > which is a bit user-hostile. > gsmith=# select name,context from pg_settings where name like 'archive%'; name | context -----------------+------------archive_command | sighuparchive_mode | postmasterarchive_timeout | sighup I expect for your particular bad situation, you can replace the archive_command with a corrected one, use "pg_ctl reload" to send a SIGHUP to make that fix active, and escape from this. That's the only right way out of this situation. You can't just abort a backup someone has asked for just because archives are failing and allow the server to shutdown cleanly in this situation. That's the wrong thing to do for production setups; the last thing you want for a system with archiving issues is to be stopped normally if it's interfering with an explicit admin requested backup. Not necessarily any reason that backup even needs to fail, and no reason for the server to get restarted in this situation at all. If the archive_command never returned false information, and in fact just returned a valid error code, all of the segments needed to make the backup consistent will be queued up waiting for the problem to be fixed. Put the fixed archive_command in place, and you're off and running again. If that's impossible, because the archive_command was really screwed up, we can just tell people to swap to an archive_command that just returns success, and let the queued up segments to be archived all get tossed away. That backup will be bad, they fix the archive_command, send SIGHUP, and start over with a new backup. There's some doc patches that could guide how to handle this situation better for sure, but I don't see any code changes needed. Everything working as designed, optimized for production use at the expense of some confusion on how to recover if you configure things badly. I suggested a patch a few weeks ago to make "what is the archiver doing?" behavior easier to monitor, got the impression people felt it was redundant given SR was the preferred path moving forward and eventually this whole archive_command bit would be going away. I could revive that work if you feel this is such a bad issue that we need a better way to watch what the archiver is doing. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Josh Berkus wrote: >> OK, so we need a way out of that cycle if the user is issuing >> pg_stop_backup because they *already know* that archive_command is >> failing. Right now, there's no way out other than a fast shutdown, > Sure there is. Just kill the session, Ctrl-c or similar. > pg_stop_backup() isn't actually doing anything at that point anymore; > it's just waiting for the files to be archived before returning. One objection to this is that it's not very clear to the user when pg_stop_backup has finished with actual work and is just waiting for the archiver, ie when is it safe to hit control-C? Maybe we should emit a "backup done, waiting for archiver to complete" notice before entering the sleep loop. regards, tom lane
Greg, > I expect for your particular bad situation, you can replace the > archive_command with a corrected one, use "pg_ctl reload" to send a > SIGHUP to make that fix active, and escape from this. That's the only > right way out of this situation. You can't just abort a backup someone > has asked for just because archives are failing and allow the server to > shutdown cleanly in this situation. That's the wrong thing to do for > production setups; the last thing you want for a system with archiving > issues is to be stopped normally if it's interfering with an explicit > admin requested backup. Yeah, I can see that for large production setups with multiple staff. We also need something newbie-friendly (and friendly to the large number of users we have where the DBA/Sysadmin is just the most skilled web developer) though. The above procedure is far too complex for someone who is "just trying out" PostgreSQL as a replacement for MySQL, and if recent conferences are anything to go by, we're about to have several thousand such users. BTW, please stop treating this issue as something which happens "only to Josh". I wouldn't be raising it if it weren't a natural circumstance which anyone who is trying PostgreSQL with HS/SR for the first time, with no experience with Warm Standby, would get into. Such new users are *likely* to get archive_command wrong, and likely to want to start over when they do. If we make that painful for them, they'll just switch to MySQL or CouchDB instead. Thing is, if archive_command is failing, then the backup is useless regardless until it's fixed. And sending the archives to /dev/null (the fix you're essentially recommending above) doesn't make the backup any more useful. So I'm seeing pg_abort_backup(), which also produces a markers which prevent the backup from loading, as an improvement on current UI. --Josh Berkus
Josh Berkus wrote: > So I'm seeing pg_abort_backup(), which also produces a > markers which prevent the backup from loading, as an improvement on > current UI. Starting with 9.0, if recovery doesn't see a end-of-backup record, it will refuse to start up. In earlier versions we had a similar mechanism using the backup history files. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Josh Berkus <josh@agliodbs.com> writes: > Thing is, if archive_command is failing, then the backup is useless > regardless until it's fixed. And sending the archives to /dev/null (the > fix you're essentially recommending above) doesn't make the backup any > more useful. So I'm seeing pg_abort_backup(), which also produces a > markers which prevent the backup from loading, as an improvement on > current UI. On reflection I'm not sure what pg_abort_backup would do for you. As Heikki points out, by the time the user has realized that pg_stop_backup() is not completing, it's *already done* all of the state changes it's going to make. There is no way to take the backup-complete WAL entry out of the WAL stream; it's already in there and there's probably ordinary entries after it by now. Having a oh-the-backup-failed-after-all entry somewhere downstream of that is entirely useless; the more so because by the time anything could *see* such an entry, the problem would have been resolved, since the problem is exactly not having gotten the WAL stream out to the archive. Before you could enter pg_abort_backup you'd have to control-C out of the pg_stop_backup call, and that action already accomplishes the only thing pg_abort_backup could do for you. So what I am thinking is that this is really just a minor bit of user unfriendliness in pg_stop_backup. We should address it with one or both of these changes: * emit a NOTICE as soon as pg_stop_backup's actual work is done and it's starting to wait for the archiver (or maybe after it's waited for a few seconds, but much less than the present 60). * extend the existing WARNING (and the NOTICE too if we elect to have one) with a HINT message explicitly saying that you can cancel the wait but thus-and-such consequences might ensue. Both of these things would only be helpful when using client software that shows you received notices promptly. psql is okay, but maybe pgAdmin and other tools would need some further work. There is not much we can do about that in the core project though. regards, tom lane
On Wed, 2010-02-24 at 13:30 -0800, Josh Berkus wrote: > So I'm seeing pg_abort_backup(), which also produces a > markers which prevent the backup from loading, as an improvement on > current UI. Since Kevin suggested this in his first post and I agreed with that in the first paragraph of my first post, I think you've wasted a lot of time here going in circles. 42 posts, more than a dozen people. I think we have better things to do than this small issue, which has nothing at all to do with a 9.0 feature. I think you should look at prioritisation. There are many things seriously in need of fixing and this wasn't one of them. Please test the following patch to see if it meets your needs and check the wordings used in the docs. -- Simon Riggs www.2ndQuadrant.com
Attachment
Simon Riggs <simon@2ndQuadrant.com> writes: > Please test the following patch to see if it meets your needs and check > the wordings used in the docs. What exactly will that function accomplish, given the assumption that the user already tried pg_stop_backup? regards, tom lane
Tom, Simon, > * emit a NOTICE as soon as pg_stop_backup's actual work is done and > it's starting to wait for the archiver (or maybe after it's waited > for a few seconds, but much less than the present 60). > > * extend the existing WARNING (and the NOTICE too if we elect to have > one) with a HINT message explicitly saying that you can cancel the > wait but thus-and-such consequences might ensue. > > Both of these things would only be helpful when using client software > that shows you received notices promptly. psql is okay, but maybe > pgAdmin and other tools would need some further work. There is not > much we can do about that in the core project though. Well, the client software could be fixed in time for 9.0, I'd think. I think that implementing both of the above would probably do the trick for user-friendliness, enough for 9.0. If it's obvious to the user on the console what to do, then they won't panic. > Since Kevin suggested this in his first post and I agreed with that in > the first paragraph of my first post, I think you've wasted a lot of > time here going in circles. 42 posts, more than a dozen people. I think Please tone down the hostility, Simon. I don't think talking about an issue I encountered while testing is a waste of anyone's time, it's how we improve the software. In fact, I'm hoping that potential testers are noticing the drubbing you're getting over this, because belittling anyone's bug reports is not exactly a good way to attract new testers to the project. Further, the multiple posts seem to have arrived at a minimally intrusive solution, so it seems like time well spent. > we have better things to do than this small issue, which has nothing at > all to do with a 9.0 feature. I think you should look at prioritisation. > There are many things seriously in need of fixing and this wasn't one of > them. You've made it clear, repeatedly, that you don't happen to think that new user experience is a priority for 9.0. However, a lot of us on this list disagree with you, and will continue to do so. Priorities are a community decision, not an individual developer one. --Josh Berkus
Josh Berkus wrote: > Thing is, if archive_command is failing, then the backup is useless > regardless until it's fixed. And sending the archives to /dev/null (the > fix you're essentially recommending above) doesn't make the backup any > more useful. That's not what I said to do first. If it's possible to fix your archive_command, and it never returned bad "I'm saying success but I didn't really do the right thing" information to the server--it just failed--this situation is completely recoverable with no damage to the backup. Just fix the archive_command, reload the configuration, and the queue of archived files will flow and eventually your consistent backup completes. This it the only behavior someone who is trying to recover from a mistake in production is likely to find acceptable, and as Simon has pointed out that is what the current situation is optimized for. Only in the situation where the archive_command was so bad that it returned the wrong data to the server--saying the segment was saved but it really wasn't--did I suggest that you might as well change archive_command to go nowhere. Because in that case, your backup is already screwed, you lost an essential piece of it. As far your comment about treating this like it's a problem specific to you, did you miss the part where I pointed out I was just expressing concerns about poor visiblity into this area ("what is the archiver doing?") recently? I'm well aware this path is full of difficult to escape from holes. We just need to be careful not do something that screws over production users in the name of reducing the learning curve. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> That's not what I said to do first. If it's possible to fix your > archive_command, and it never returned bad "I'm saying success but I > didn't really do the right thing" information to the server--it just > failed--this situation is completely recoverable with no damage to the > backup. Just fix the archive_command, reload the configuration, and the > queue of archived files will flow and eventually your consistent backup > completes. This it the only behavior someone who is trying to recover > from a mistake in production is likely to find acceptable, and as Simon > has pointed out that is what the current situation is optimized for. Right. I'm pointing out that production and "trying out 9.0 for the first time" are actually different circumstances, and we need to be able to handle both gracefully. Since, if people have a bad experience trying it out for the first time, we'll never *get* to production. > As far your comment about treating this like it's a problem specific to > you, did you miss the part where I pointed out I was just expressing > concerns about poor visiblity into this area ("what is the archiver > doing?") recently? I'm well aware this path is full of difficult to > escape from holes. We just need to be careful not do something that > screws over production users in the name of reducing the learning curve. I think Tom's idea is minimally intrusive, and deals with the central problem, which is one of UI and visibility as you assessed. --Josh Berkus
Josh Berkus <josh@agliodbs.com> writes: > Tom, Simon, >> * emit a NOTICE as soon as pg_stop_backup's actual work is done and >> it's starting to wait for the archiver (or maybe after it's waited >> for a few seconds, but much less than the present 60). >> >> * extend the existing WARNING (and the NOTICE too if we elect to have >> one) with a HINT message explicitly saying that you can cancel the >> wait but thus-and-such consequences might ensue. >> >> Both of these things would only be helpful when using client software >> that shows you received notices promptly. psql is okay, but maybe >> pgAdmin and other tools would need some further work. There is not >> much we can do about that in the core project though. > Well, the client software could be fixed in time for 9.0, I'd think. I > think that implementing both of the above would probably do the trick > for user-friendliness, enough for 9.0. If it's obvious to the user on > the console what to do, then they won't panic. If you like the concept, then the next question is exactly how to phrase the messages. All we have at the moment is the inside-the-delay-loop warning: ereport(WARNING, (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)", waits))); which now that I look at it could use some wordsmithing itself. Suggestions? regards, tom lane
On Feb 24, 2010, at 3:24 PM, Tom Lane wrote: > If you like the concept, then the next question is exactly how to phrase > the messages. All we have at the moment is the inside-the-delay-loop > warning: > > ereport(WARNING, > (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)", > waits))); > > which now that I look at it could use some wordsmithing itself. “Bitch, can’t you see that I’m still waiting for the archive to complete? WTF were you thinking? Jesus!” My $0.02. Best, David
> If you like the concept, then the next question is exactly how to phrase > the messages. All we have at the moment is the inside-the-delay-loop > warning: > > ereport(WARNING, > (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)", > waits))); Well, we'll want this message first, as soon as pg_stop_backup finishes checkpointing: WARNING: Stop backup work complete. Now awaiting completion of WAL archiving. Then after 60s: WARNING: pg_stop_backup is still waiting for WAL archiving to complete (%d seconds elapsed). HINT: Check if your WAL archive_command is failing. You may abort pg_stop_backup at this point, but you will not be able to use the resulting clone.
On Wed, 2010-02-24 at 16:52 -0500, Tom Lane wrote: > Before you could enter pg_abort_backup you'd have to control-C out of > the pg_stop_backup call, and that action already accomplishes the only > thing pg_abort_backup could do for you. Agreed. I was responding to perceived user need. > So what I am thinking is that this is really just a minor bit of user > unfriendliness in pg_stop_backup. We should address it with one or > both of these changes: > > * emit a NOTICE as soon as pg_stop_backup's actual work is done and > it's starting to wait for the archiver (or maybe after it's waited > for a few seconds, but much less than the present 60). Pointless really. Nobody runs backups in production by typing pg_stop_backup() except in a demo. Nobody will see this. > * extend the existing WARNING (and the NOTICE too if we elect to have > one) with a HINT message explicitly saying that you can cancel the > wait but thus-and-such consequences might ensue. If you can see the HINT, you can also see the WARNING. If you can see the WARNING and do nothing, I don't think we need a "objects in the mirror may be closer than they appear" message. If people can't work out that if a) they are running something and b) that something is waiting that they should cancel it then we aren't going to have much luck with them. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-02-24 at 14:20 -0800, Josh Berkus wrote: > Since Kevin suggested this in his first post and I agreed with that in > > the first paragraph of my first post, I think you've wasted a lot of > > time here going in circles. 42 posts, more than a dozen people. I > think > > Please tone down the hostility, Simon. I don't think talking about an > issue I encountered while testing is a waste of anyone's time, it's > how we improve the software. In fact, I'm hoping that potential > testers are noticing the drubbing you're getting over this, because > belittling anyone's bug reports is not exactly a good way to attract > new testers to the project. Saying "its not a bug" doesn't belittle your bug report. Your first report was not time wasting, but talking endlessly about a subject that you've had clear replies on becomes time wasting. As I've said many times now, this isn't even an 9.0 issue. Expressing that opinion is not hostility. I'm not sure why you think *I* am receiving a drubbing? You made a mistake on a demo, filed a bug report and wouldn't listen to people telling you its not a bug. I admire your attempts at oneupmanship. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2010-02-24 at 16:52 -0500, Tom Lane wrote: >> * emit a NOTICE as soon as pg_stop_backup's actual work is done and >> it's starting to wait for the archiver (or maybe after it's waited >> for a few seconds, but much less than the present 60). > Pointless really. Nobody runs backups in production by typing > pg_stop_backup() except in a demo. Nobody will see this. I agree it's pointless in production, but this isn't about production, it's about friendliness to people who are experimenting. The case will probably never come up in production because a production installation should have a non-broken archive_command. >> * extend the existing WARNING (and the NOTICE too if we elect to have >> one) with a HINT message explicitly saying that you can cancel the >> wait but thus-and-such consequences might ensue. > If you can see the HINT, you can also see the WARNING. If you can see > the WARNING and do nothing, I don't think we need a "objects in the > mirror may be closer than they appear" message. If people can't work out > that if a) they are running something and b) that something is waiting > that they should cancel it then we aren't going to have much luck with > them. The value of the HINT I think would be to make them (a) not afraid to hit control-C and (b) aware of the fact that their archiver has got a problem. regards, tom lane
On Wed, 2010-02-24 at 23:57 +0000, Simon Riggs wrote: > > > * emit a NOTICE as soon as pg_stop_backup's actual work is done and > > it's starting to wait for the archiver (or maybe after it's waited > > for a few seconds, but much less than the present 60). > > Pointless really. Nobody runs backups in production by typing > pg_stop_backup() except in a demo. Nobody will see this. This is not true. It is not uncommon for a pitr setup to get out of sync for any number of production reasons. It is one of the reasons that PITRTools supports executing a pg_stop_backup. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Wed, 2010-02-24 at 19:08 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Wed, 2010-02-24 at 16:52 -0500, Tom Lane wrote: > >> * emit a NOTICE as soon as pg_stop_backup's actual work is done and > >> it's starting to wait for the archiver (or maybe after it's waited > >> for a few seconds, but much less than the present 60). > > > Pointless really. Nobody runs backups in production by typing > > pg_stop_backup() except in a demo. Nobody will see this. > > I agree it's pointless in production, but this isn't about production, > it's about friendliness to people who are experimenting. The case will > probably never come up in production because a production installation > should have a non-broken archive_command. No further objection. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-02-24 at 10:07 -0800, Josh Berkus wrote: > Simon, > > Your mistake was not typoing an archive_command, it was not correctly > > testing that what you had done was actually working. The fix is to read > > the manual and correct the typo. Shutting down the server after failing > > to configure it is not likely to be a normal reaction to experiencing an > > error in configuration. > > The problem is you're thinking of an experienced PostgreSQL DBA doing > setup on a production server. That's not what I'm talking about. I'm > talking about the thousands of new users who are going to try PostgreSQL > for the first time because of HS/SR on a test installation. If they > encounter this issue, they will decide (again) that PostgreSQL is too > hard to use and give up on us for another 5 years. Shoot forget the "new users", I am thinking about the hundreds of thousands of existing NOT DBA users. E.g; 90% of our user base. > > Saying "RTFM and test, you newbie!" is not a valid response, and that's > what your "you should have read the docs" amounts to. Heck, I *did* > read the docs. Agreed. Although RTFM is important, we shouldn't have RTFM for something that is clearly a user visible behavior mistake on our part. > > > ISTM you should collect test reports, then analyse and prioritise them. > > This rates pretty low for me: low severity, low frequency. > > To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of > LAPUG all find this highly problematic behavior. So consider it 6 > problem reports, not just one. > Basically the reports boil down to people who are actually going to be dealing with this in the field. Simon with respect you have been 6 feet deep in code for too long on this. You need to step back and take some constructive feedback from those that are dealing with the production issues and do so with a smile. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
Tom Lane wrote: > The value of the HINT I think would be to make them (a) not afraid to > hit control-C and (b) aware of the fact that their archiver has got > a problem. > Agreed on both points. Patch attached that implements something similar to Josh's wording, tweaking the original warning too. Here's what it looks like when you run into the bad situation (which I easily simulated with "archive_command='/bin/false'") from the client's perspective: gsmith@meddle:~/pgwork/src/master/src$ psql -c "select pg_start_backup('test')" pg_start_backup ----------------- 0/5000020 (1 row) gsmith@meddle:~/pgwork/src/master/src$ psql psql (9.0devel) Type "help" for help. gsmith=# select pg_stop_backup(); NOTICE: pg_stop_backup cleanup done, waiting for required segments to archive WARNING: pg_stop_backup still waiting for all required segments to archive (60 seconds elapsed) HINT: Confirm your archive_command is executing successfully. pg_stop_backup can be aborted safely, but the resulting backup will not be usable. ^CCancel request sent ERROR: canceling statement due to user request And this is the sort of thing that shows up in the logs with default logging behavior while all this is happening; you don't see the NOTICE, but the WARNING and HINT are both there which I think is good: LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false WARNING: transaction log file "000000010000000000000000" could not be archived: too many failures WARNING: pg_stop_backup still waiting for all required segments to archive (60 seconds elapsed) HINT: Confirm your archive_command is executing successfully. pg_stop_backup can be aborted safely, but the resulting backup will not be usable. Does this solve the logging side of this? You can still make a case for a more forceful pg_stop_backup, this seems to at least remove much of the mystery and frustration from the whole exercise. This patch plus a little documentation suggesting how to recover from this issue might be enough. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ca088b0..c09ede9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8125,6 +8125,9 @@ pg_stop_backup(PG_FUNCTION_ARGS) BackupHistoryFileName(histfilename, ThisTimeLineID, _logId, _logSeg, startpoint.xrecoff % XLogSegSize); + ereport(NOTICE, + (errmsg("pg_stop_backup cleanup done, waiting for required segments to archive"))); + seconds_before_warning = 60; waits = 0; @@ -8139,8 +8142,10 @@ pg_stop_backup(PG_FUNCTION_ARGS) { seconds_before_warning *= 2; /* This wraps in >10 years... */ ereport(WARNING, - (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)", - waits))); + (errmsg("pg_stop_backup still waiting for all required segments to archive (%d seconds elapsed)", + waits), + errhint("Confirm your archive_command is executing successfully. " + "pg_stop_backup can be aborted safely, but the resulting backup will not be usable."))); } }
On 2/24/10 5:36 PM, Greg Smith wrote: > gsmith=# select pg_stop_backup(); > NOTICE: pg_stop_backup cleanup done, waiting for required segments to > archive > WARNING: pg_stop_backup still waiting for all required segments to > archive (60 seconds elapsed) > HINT: Confirm your archive_command is executing successfully. > pg_stop_backup can be aborted safely, but the resulting backup will not > be usable. > ^CCancel request sent > ERROR: canceling statement due to user request This looks really good, thanks! > Does this solve the logging side of this? You can still make a case for > a more forceful pg_stop_backup, this seems to at least remove much of > the mystery and frustration from the whole exercise. This patch plus a > little documentation suggesting how to recover from this issue might be > enough. Yeah, the concern is user-friendliness. As Simon points out, allowing pg_stop_backup to abort would have other unexpected-results issues. --Josh Berkus
Josh Berkus <josh@agliodbs.com> writes: > On 2/24/10 5:36 PM, Greg Smith wrote: >> gsmith=# select pg_stop_backup(); >> NOTICE: pg_stop_backup cleanup done, waiting for required segments to >> archive >> WARNING: pg_stop_backup still waiting for all required segments to >> archive (60 seconds elapsed) >> HINT: Confirm your archive_command is executing successfully. >> pg_stop_backup can be aborted safely, but the resulting backup will not >> be usable. >> ^CCancel request sent >> ERROR: canceling statement due to user request > This looks really good, thanks! The one thing I'm undecided about is whether we want the immediate NOTICE, as opposed to dialing down the time till the first WARNING to something like 5 or 10 seconds. I think the main argument for the latter approach would be to avoid log-spam in normal operation. Although Greg is correct that a NOTICE wouldn't be logged at default log levels, lots of people don't use that default. Comments? regards, tom lane
On Wed, Feb 24, 2010 at 08:52:28PM -0500, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > On 2/24/10 5:36 PM, Greg Smith wrote: > >> gsmith=# select pg_stop_backup(); > >> NOTICE: pg_stop_backup cleanup done, waiting for required segments to > >> archive > >> WARNING: pg_stop_backup still waiting for all required segments to > >> archive (60 seconds elapsed) > >> HINT: Confirm your archive_command is executing successfully. > >> pg_stop_backup can be aborted safely, but the resulting backup will not > >> be usable. > >> ^CCancel request sent > >> ERROR: canceling statement due to user request > > > This looks really good, thanks! > > The one thing I'm undecided about is whether we want the immediate > NOTICE, as opposed to dialing down the time till the first WARNING > to something like 5 or 10 seconds. I think the main argument for > the latter approach would be to avoid log-spam in normal operation. > Although Greg is correct that a NOTICE wouldn't be logged at default > log levels, lots of people don't use that default. Comments? As I see it, the clarity concern trumps the log spam one. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Tom Lane wrote: > The one thing I'm undecided about is whether we want the immediate > NOTICE, as opposed to dialing down the time till the first WARNING > to something like 5 or 10 seconds. I think the main argument for the > latter approach would be to avoid log-spam in normal operation I though about that for a minute, but didn't think pg_stop_backup is a common enough operation that anyone will complain that it's a little more verbose in its logging now. I know when I was new to this, I used to wonder just what it was busy doing just after executing this command when it hung there for a while sometimes, and would have welcomed this extra bit of detail--preferably immediately, not even after a 5 or 10 second delay. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 2/24/10 5:58 PM, Greg Smith wrote: > > I though about that for a minute, but didn't think pg_stop_backup is a > common enough operation that anyone will complain that it's a little > more verbose in its logging now. I know when I was new to this, I used > to wonder just what it was busy doing just after executing this command > when it hung there for a while sometimes, and would have welcomed this > extra bit of detail--preferably immediately, not even after a 5 or 10 > second delay. +1 --Josh
On Thu, Feb 25, 2010 at 10:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The one thing I'm undecided about is whether we want the immediate > NOTICE, as opposed to dialing down the time till the first WARNING > to something like 5 or 10 seconds. I think the main argument for the > latter approach would be to avoid log-spam in normal operation. > Although Greg is correct that a NOTICE wouldn't be logged at default > log levels, lots of people don't use that default. Comments? I don't want that immediate NOTICE message, which sounds like a noise. Delaying it or changing the log level to DEBUG work for me. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> I don't want that immediate NOTICE message, which sounds like a noise. > Delaying it or changing the log level to DEBUG work for me. Problem is that a new user won't be seeing DEBUG messages by default. This issue is all about new user experience. Alternatively, we could move the time of the first "waiting for archive" message up, but that seems misleading. --Josh Berkus
Greg Smith <greg@2ndquadrant.com> writes: > Tom Lane wrote: >> The value of the HINT I think would be to make them (a) not afraid to >> hit control-C and (b) aware of the fact that their archiver has got >> a problem. >> > Agreed on both points. Patch attached that implements something similar > to Josh's wording, tweaking the original warning too. OK, everyone likes the immediate NOTICE. I did a bit of copy-editing and committed the attached version. regards, tom lane Index: xlog.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.377 diff -c -r1.377 xlog.c *** xlog.c 19 Feb 2010 10:51:03 -0000 1.377 --- xlog.c 25 Feb 2010 02:15:49 -0000 *************** *** 8132,8138 **** * * We wait forever, since archive_command is supposed to work and we * assume the adminwanted his backup to work completely. If you don't ! * wish to wait, you can set statement_timeout. */ XLByteToPrevSeg(stoppoint, _logId, _logSeg); XLogFileName(lastxlogfilename,ThisTimeLineID, _logId, _logSeg); --- 8132,8139 ---- * * We wait forever, since archive_command is supposed to work and we * assume the adminwanted his backup to work completely. If you don't ! * wish to wait, you can set statement_timeout. Also, some notices ! * are issued to clue in anyone who might be doing this interactively. */ XLByteToPrevSeg(stoppoint, _logId,_logSeg); XLogFileName(lastxlogfilename, ThisTimeLineID, _logId, _logSeg); *************** *** 8141,8146 **** --- 8142,8150 ---- BackupHistoryFileName(histfilename, ThisTimeLineID, _logId, _logSeg, startpoint.xrecoff% XLogSegSize); + ereport(NOTICE, + (errmsg("pg_stop_backup cleanup done, waiting for required WAL segments to be archived"))); + seconds_before_warning = 60; waits = 0; *************** *** 8155,8162 **** { seconds_before_warning *= 2; /* This wraps in >10 years... */ ereport(WARNING, ! (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)", ! waits))); } } --- 8159,8169 ---- { seconds_before_warning *= 2; /* This wraps in >10 years... */ ereport(WARNING, ! (errmsg("pg_stop_backup still waiting for all required WAL segments to be archived (%d seconds elapsed)", ! waits), ! errhint("Check that your archive_command is executing properly. " ! "pg_stop_backup can be cancelled safely, " ! "but the database backup will not be usable without all the WAL segments."))); } }
On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote: > On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote: > You make the mistake of assuming that someone that can develop has no > solution experience. That is exactly how I fund further development, so > you are off base by a long way. I never implied that. I implied that your perspective is currently skewed. I stand by that implication. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Wed, 2010-02-24 at 12:32 -0800, Josh Berkus wrote: > > pg_stop_backup() doesn't complete until all the WAL segments needed to > > restore from the backup are archived. If archive_command is failing, > > that never happens. > > OK, so we need a way out of that cycle if the user is issuing > pg_stop_backup because they *already know* that archive_command is > failing. Right now, there's no way out other than a fast shutdown, > which is a bit user-hostile. Hmmm well... changing the archive_command to /bin/true and issuing a HUP would cause the command to succeed, but I still think that is over the top. I prefer Kevin's solution or some variant thereof: http://archives.postgresql.org/pgsql-hackers/2010-02/msg01853.php http://archives.postgresql.org/pgsql-hackers/2010-02/msg01907.php Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Wed, 2010-02-24 at 23:57 +0000, Simon Riggs wrote: > > > * emit a NOTICE as soon as pg_stop_backup's actual work is done and > > it's starting to wait for the archiver (or maybe after it's waited > > for a few seconds, but much less than the present 60). > > Pointless really. Nobody runs backups in production by typing > pg_stop_backup() except in a demo. Nobody will see this. This is not true. It is not uncommon for a pitr setup to get out of sync for any number of production reasons. It is one of the reasons that PITRTools supports executing a pg_stop_backup. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.
On Wed, Feb 24, 2010 at 11:14 PM, Josh Berkus <josh@agliodbs.com> wrote: > > Right. I'm pointing out that production and "trying out 9.0 for the > first time" are actually different circumstances, and we need to be able > to handle both gracefully. Since, if people have a bad experience > trying it out for the first time, we'll never *get* to production. Fwiw if it's not clear what's going on when you're trying out something carefully for the first time it's 10x worse if you're stuck in a situation like this when you have people breathing down your neck yelling about how they're losing money for every second you're down. In an ideal world it would be best if pg_stop_backup could actually print the error status of the archiving command. Is there any way for it to get ahold of the fact that the archiving is failing? And do we have closure on whether a "fast" shutdown is hanging? Or was that actually a smart shutdown? Perhaps "smart" shutdown needs to print out what it's waiting on periodically as well, and suggest a fast shutdown to abort those transactions. -- greg
Heikki Linnakangas wrote: > Josh Berkus wrote: > > OK, can you go through the reasons why pg_stop_backup would not > > complete? > > pg_stop_backup() doesn't complete until all the WAL segments needed to > restore from the backup are archived. If archive_command is failing, > that never happens. Yes, very old behavior allowed people to think they had a full backup when the WAL files needed were not all archived, which was a bad thing. Thankfully no one reported catastrophic failure from the old behavior. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
> In an ideal world it would be best if pg_stop_backup could actually > print the error status of the archiving command. Agreed. > And do we have closure on whether a "fast" shutdown is hanging? Or was > that actually a smart shutdown? No, I need to retest and verify 100% that the issue wasn't something other than stop_backup. > Perhaps "smart" shutdown needs to print out what it's waiting on > periodically as well, and suggest a fast shutdown to abort those > transactions. That would be a good thing to have for PostgreSQL in general. Given that any number of things can stop a smart shutdown, it's more than a little baffling to users why one hangs forever. BUT ... since most users run smart shutdown via a services script, output on what shutdown is waiting on would need to be written to the log rather than given interactively. --Josh Berkus
Greg Stark wrote: > In an ideal world it would be best if pg_stop_backup could actually > print the error status of the archiving command. Is there any way for > it to get ahold of the fact that the archiving is failing? > This is in the area I mentioned I'd proposed a patch to improve not too long ago. The archiver doesn't tell anyone anything about what it's doing right now, or even save its state information. I made a proposal for making the bit it's currently working on (or just finished, or both) visible not too long ago: http://archives.postgresql.org/message-id/4B4FEA18.5080705@2ndquadrant.com The main content for that was tracking disk space, which wandered into a separate discussion, but it would be easy enough to use the information that intends to export ("what archive file is currently being processed?") and print that in the error message too. Makes it easy enough for people to infer the command is failing if the same segment number shows up every time in that message. I didn't finish that only because the CF kicked off and I switched out of new development to review. Since this class of error keeps popping up, I could easily finish that patch off by next week and see if it helps here. I thought it was a long overdue bit of monitoring to add to the database anyway, just never had the time to work on it before. > And do we have closure on whether a "fast" shutdown is hanging? Or was > that actually a smart shutdown? > When I tested this myself, a smart shutdown hung every time, while a fast one blew right through the problem--matching what's described in the manual. Josh suggested at one point he might have seen a situation where fast shutdown wasn't sufficient to work around this and an immediate one was required. Certainly possible that happened for an as yet unknown reason--I've seen plenty of situations where fast shutdown didn't work--but I haven't been able to replicate it. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Joshua D. Drake wrote: > On Wed, 2010-02-24 at 12:32 -0800, Josh Berkus wrote: > > > pg_stop_backup() doesn't complete until all the WAL segments needed to > > > restore from the backup are archived. If archive_command is failing, > > > that never happens. > > > > OK, so we need a way out of that cycle if the user is issuing > > pg_stop_backup because they *already know* that archive_command is > > failing. Right now, there's no way out other than a fast shutdown, > > which is a bit user-hostile. > > Hmmm well... changing the archive_command to /bin/true and issuing a HUP > would cause the command to succeed, but I still think that is over the > top. I prefer Kevin's solution or some variant thereof: > > http://archives.postgresql.org/pgsql-hackers/2010-02/msg01853.php > http://archives.postgresql.org/pgsql-hackers/2010-02/msg01907.php Postgres 9.0 will be the first release to mention /bin/true as a way of turning off archiving in extraordinary circumstances: http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
Looks like we arrived at the best solution here. I don't think it was clear to users that pg_stop_backup() was issuing an archive_command and hence they wouldn't be likely to understand the delay or correct a problem. This gives them the information they need at the time they need it. --------------------------------------------------------------------------- Tom Lane wrote: > Greg Smith <greg@2ndquadrant.com> writes: > > Tom Lane wrote: > >> The value of the HINT I think would be to make them (a) not afraid to > >> hit control-C and (b) aware of the fact that their archiver has got > >> a problem. > >> > > Agreed on both points. Patch attached that implements something similar > > to Josh's wording, tweaking the original warning too. > > OK, everyone likes the immediate NOTICE. I did a bit of copy-editing > and committed the attached version. > > regards, tom lane > > Index: xlog.c > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v > retrieving revision 1.377 > diff -c -r1.377 xlog.c > *** xlog.c 19 Feb 2010 10:51:03 -0000 1.377 > --- xlog.c 25 Feb 2010 02:15:49 -0000 > *************** > *** 8132,8138 **** > * > * We wait forever, since archive_command is supposed to work and we > * assume the admin wanted his backup to work completely. If you don't > ! * wish to wait, you can set statement_timeout. > */ > XLByteToPrevSeg(stoppoint, _logId, _logSeg); > XLogFileName(lastxlogfilename, ThisTimeLineID, _logId, _logSeg); > --- 8132,8139 ---- > * > * We wait forever, since archive_command is supposed to work and we > * assume the admin wanted his backup to work completely. If you don't > ! * wish to wait, you can set statement_timeout. Also, some notices > ! * are issued to clue in anyone who might be doing this interactively. > */ > XLByteToPrevSeg(stoppoint, _logId, _logSeg); > XLogFileName(lastxlogfilename, ThisTimeLineID, _logId, _logSeg); > *************** > *** 8141,8146 **** > --- 8142,8150 ---- > BackupHistoryFileName(histfilename, ThisTimeLineID, _logId, _logSeg, > startpoint.xrecoff % XLogSegSize); > > + ereport(NOTICE, > + (errmsg("pg_stop_backup cleanup done, waiting for required WAL segments to be archived"))); > + > seconds_before_warning = 60; > waits = 0; > > *************** > *** 8155,8162 **** > { > seconds_before_warning *= 2; /* This wraps in >10 years... */ > ereport(WARNING, > ! (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)", > ! waits))); > } > } > > --- 8159,8169 ---- > { > seconds_before_warning *= 2; /* This wraps in >10 years... */ > ereport(WARNING, > ! (errmsg("pg_stop_backup still waiting for all required WAL segments to be archived (%d seconds elapsed)", > ! waits), > ! errhint("Check that your archive_command is executing properly. " > ! "pg_stop_backup can be cancelled safely, " > ! "but the database backup will not be usable without all the WAL segments."))); > } > } > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.comPG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive,Christ can be your backup. +
--On 24. Februar 2010 16:01:02 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > One objection to this is that it's not very clear to the user when > pg_stop_backup has finished with actual work and is just waiting for the > archiver, ie when is it safe to hit control-C? Maybe we should emit a > "backup done, waiting for archiver to complete" notice before entering > the sleep loop. +1 for this. This hint would certainly help to recognize the issue immediately (or at least point to a possible cause). -- Thanks Bernd
On Fri, Feb 26, 2010 at 9:41 AM, Bernd Helmle <mailings@oopsware.de> wrote: > > > --On 24. Februar 2010 16:01:02 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> One objection to this is that it's not very clear to the user when >> pg_stop_backup has finished with actual work and is just waiting for the >> archiver, ie when is it safe to hit control-C? Maybe we should emit a >> "backup done, waiting for archiver to complete" notice before entering >> the sleep loop. > > +1 for this. This hint would certainly help to recognize the issue > immediately (or at least point to a possible cause). So looking at the code we *do* print something in pg_stop_backup(). We just wait 60s before doing so. I propose we shorten that to 10s. Secondarily, the message printed at this time and when the process is finished doesn't actually give the user any information on how much longer to expect the process to take. It would be nice to say what the target archive log we're waiting on is and then periodically print out what the last archived log file was. Or perhaps just do the arithmetic and periodically print how many megabytes of log files remain to be archived. -- greg
On Fri, Feb 26, 2010 at 2:47 AM, Bruce Momjian <bruce@momjian.us> wrote: > Postgres 9.0 will be the first release to mention /bin/true as a way of > turning off archiving in extraordinary circumstances: > > http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html > Setting archive_mode to a command that does nothing but return true, e.g. /bin/true, "return true" seems ambiguous for me. How about writing clearly "return a zero exit status" instead? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Feb 26, 2010 at 10:00 PM, Greg Stark <gsstark@mit.edu> wrote: > Secondarily, the message printed at this time and when the process is > finished doesn't actually give the user any information on how much > longer to expect the process to take. > > It would be nice to say what the target archive log we're waiting on > is and then periodically print out what the last archived log file > was. +1 We would be easily able to calculate the last archived log file from the existence of archive status files. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: <blockquote cite="mid:3f0b79eb1003012220x358e072atc7e3f322d1d24466@mail.gmail.com" type="cite"><pre wrap="">OnFri, Feb 26, 2010 at 2:47 AM, Bruce Momjian <a class="moz-txt-link-rfc2396E" href="mailto:bruce@momjian.us"><bruce@momjian.us></a>wrote: </pre><blockquote type="cite"><pre wrap="">Postgres 9.0will be the first release to mention /bin/true as a way of turning off archiving in extraordinary circumstances: <a class="moz-txt-link-freetext" href="http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html">http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html</a> </pre></blockquote><blockquote type="cite"><pre wrap="">Setting archive_mode to a command that does nothing but return true,e.g. /bin/true, </pre></blockquote><pre wrap=""> "return true" seems ambiguous for me. How about writing clearly "return a zero exit status" instead? </pre></blockquote><br /> This is a good catch, and I have a work in progress updateto that doc section that fixes that wording, as well as rearranging the recent additions a bit. Really that whole"/bin/true" big needs to go after the example. A very brief intro to what "exit status" means on various platformsmight be in order too. I'm adjusting all that to read better, once I'm happy with it I'll submit a doc patch inthe next week or two with the final result.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> </pre>
Fujii Masao wrote: > We would be easily able to calculate the last archived log file from > the existence of archive status files. > Right, but you have to actually scan the whole archive directory to figure that out, and I'd rather not see that code get duplicated somewhere else when it's already inside the archive_command logic. If it just shared that info with the rest of the system instead this would be trivial to discover. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith wrote: > Fujii Masao wrote: >> We would be easily able to calculate the last archived log file from >> the existence of archive status files. > > Right, but you have to actually scan the whole archive directory to > figure that out, and I'd rather not see that code get duplicated > somewhere else when it's already inside the archive_command logic. If > it just shared that info with the rest of the system instead this would > be trivial to discover. The archiver process is not connected to shared memory, so scanning the directory is the way to do it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2 Mar 2010 15:20:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote: >> Setting archive_mode to a command that does nothing but return true, >> e.g. /bin/true, > > "return true" seems ambiguous for me. How about writing clearly > "return a zero exit status" instead? For the record. I hate the fact that I ever mentioned this and I think it is a terrible hack that we would mention it in the docs. >From a professional perspective, I cringe at the idea of telling a customer to do this, not to mention it won't work on w32. Joshua D. Drake -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Tue, 2010-03-02 at 15:20 +0900, Fujii Masao wrote: > On Fri, Feb 26, 2010 at 2:47 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Postgres 9.0 will be the first release to mention /bin/true as a way of > > turning off archiving in extraordinary circumstances: > > > > http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html > > > > Setting archive_mode to a command that does nothing but return true, e.g. /bin/true, > > "return true" seems ambiguous for me. How about writing clearly > "return a zero exit status" instead? Docs are already quite clear on that point. I think we should avoid specifying it twice. -- Simon Riggs www.2ndQuadrant.com
On Tue, Mar 2, 2010 at 9:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Setting archive_mode to a command that does nothing but return true, e.g. /bin/true, >> >> "return true" seems ambiguous for me. How about writing clearly >> "return a zero exit status" instead? > > Docs are already quite clear on that point. I think we should avoid > specifying it twice. > Why do we disallow turning off archive_mode anyways? I understand not turning it on -- though even that would be nice if it "took effect after the next checkpoint" but turning it off should always be safe, no? -- greg
On Tue, 2010-03-02 at 13:13 +0000, Greg Stark wrote: > On Tue, Mar 2, 2010 at 9:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> > Setting archive_mode to a command that does nothing but return true, e.g. /bin/true, > >> > >> "return true" seems ambiguous for me. How about writing clearly > >> "return a zero exit status" instead? > > > > Docs are already quite clear on that point. I think we should avoid > > specifying it twice. > > > > Why do we disallow turning off archive_mode anyways? Because it is needed for safety and nobody has got around to coding the idea of turning it on/off during normal running, which is possible, with appropriate care. > I understand not > turning it on -- though even that would be nice if it "took effect > after the next checkpoint" but turning it off should always be safe, > no? We don't support that behaviour in parameters. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2010-03-02 at 13:13 +0000, Greg Stark wrote: > >> Why do we disallow turning off archive_mode anyways? >> > > Because it is needed for safety and nobody has got around to coding the > idea of turning it on/off during normal running, which is possible, with > appropriate care. > It's actually made it pretty high up on the list of desired features for some of the replication projects: http://wiki.postgresql.org/wiki/ClusterFeatures#Start.2Fstop_archiving_at_runtime Since that is one of the easier items on that list to actually knock off (probably an order of magnitude so than the average feature there), it's completely feasible somebody will do so for 9.1. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us