Thread: pg_stop_backup does not complete

pg_stop_backup does not complete

From
Josh Berkus
Date:
Simon, Fujii, All:

While demoing HS/SR at SCALE, I ran into a problem which is likely to be
a commonly encountered bug when people first setup HS/SR.  Here's the
sequence:

1) Set up a brand new master with an archive-commmand and archive=on.

2) Start the master

3) Do a pg_start_backup()

4) Realize, based on log error messages, that I've misconfigured the
archive_command.

5) Attempt to shut down the master.  Master tells me that pg_stop_backup
must be run in order to shut down.

6) Execute pg_stop_backup.

7) pg_stop_backup waits forever without ever stopping backup.  Ever 60
seconds, it give me a helpful "still waiting" message, but at least in
the amount of time I was willing to wait (5 minutes), it never completed.

8) do an immediate shutdown, as it's the only way I can get the database
unstuck.

With some experimentation, the problem seems to occur when you have a
failing archive_command and a master which currently has no database
traffic; for example, if I did some database write activity (a createdb)
then pg_stop_backup would complete after about 60 seconds (which, btw,
is extremely annoying, but at least tolerable).

This issue is 100% reproduceable.

--Josh Berkus


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> This issue is 100% reproduceable.

Oh, btw, this is on Alpha4.

--Josh Berkus


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
> Simon, Fujii, All:
>
> While demoing HS/SR at SCALE, I ran into a problem which is likely to be
> a commonly encountered bug when people first setup HS/SR.  Here's the
> sequence:
>
> 1) Set up a brand new master with an archive-commmand and archive=on.
>
> 2) Start the master
>
> 3) Do a pg_start_backup()
>
> 4) Realize, based on log error messages, that I've misconfigured the
> archive_command.
>
> 5) Attempt to shut down the master.  Master tells me that pg_stop_backup
> must be run in order to shut down.

If I issue a shutdown, PostgreSQL should do whatever it needs to do to
shutdown; including issuing a pg_stop_backup.

Joshua D. Drake




--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: pg_stop_backup does not complete

From
"Kevin Grittner"
Date:
"Joshua D. Drake" <jd@commandprompt.com> wrote:
> If I issue a shutdown, PostgreSQL should do whatever it needs to
> do to shutdown; including issuing a pg_stop_backup. 
Should we have a pg_fail_backup function, so that it doesn't put out
a file which suggests that we have a complete backup?
-Kevin


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:

> 1) Set up a brand new master with an archive-commmand and archive=on.
> 
> 2) Start the master
> 
> 3) Do a pg_start_backup()
> 
> 4) Realize, based on log error messages, that I've misconfigured the
> archive_command.

> 5) Attempt to shut down the master.  Master tells me that pg_stop_backup
> must be run in order to shut down.
> 
> 6) Execute pg_stop_backup.
> 
> 7) pg_stop_backup waits forever without ever stopping backup.  Ever 60
> seconds, it give me a helpful "still waiting" message, but at least in
> the amount of time I was willing to wait (5 minutes), it never completed.
> 
> 8) do an immediate shutdown, as it's the only way I can get the database
> unstuck.
> 
> With some experimentation, the problem seems to occur when you have a
> failing archive_command and a master which currently has no database
> traffic; for example, if I did some database write activity (a createdb)
> then pg_stop_backup would complete after about 60 seconds (which, btw,
> is extremely annoying, but at least tolerable).
> 
> This issue is 100% reproduceable.

IMHO there in no problem in that behaviour. If somebody requests a
backup then we should wait for it to complete. Kevin's suggestion of
pg_fail_backup() is the only sensible conclusion there because it gives
an explicit way out of deadlock.

ISTM the problem is that you didn't test. Steps 3 and 4 should have been
reversed. Perhaps we should put something in the docs to say "and test".
The correct resolution is to put in an archive_command that works.

We can put in an extra step to prevent a pg_start_backup() if there are
a significant number of outstanding files to be archived. Doing that
seems like closing the door after the horse has bolted, since we just
introduced streaming replication that doesn't rely on archived files. In
any case, I don't see many people working on a production system hitting
a problem on an archive_command and then deciding to shut down. 

So I don't see this as something that needs fixing for 9.0. There is
already too much non-essential code there, all of which needs to be
tested. I don't think adding in new corner cases to "help" people makes
any sense until we have automated testing that allows us to rerun the
regression tests to check all this stuff still works.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Tue, 2010-02-23 at 18:58 +0000, Simon Riggs wrote:
> On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:

> > This issue is 100% reproduceable.
>
> IMHO there in no problem in that behaviour. If somebody requests a
> backup then we should wait for it to complete. Kevin's suggestion of
> pg_fail_backup() is the only sensible conclusion there because it gives
> an explicit way out of deadlock.
>
> ISTM the problem is that you didn't test. Steps 3 and 4 should have been
> reversed. Perhaps we should put something in the docs to say "and test".
> The correct resolution is to put in an archive_command that works.

The problem isn't that it is a bad archive_command, it is that
PostgreSQL has no way to deal with this gracefully. Yes people should
test but are we dealing with the real world or not?

>
> So I don't see this as something that needs fixing for 9.0. There is
> already too much non-essential code there, all of which needs to be
> tested. I don't think adding in new corner cases to "help" people makes
> any sense until we have automated testing that allows us to rerun the
> regression tests to check all this stuff still works.

This will bite us if we release like this.

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: pg_stop_backup does not complete

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> The correct resolution is to put in an archive_command that works.
One really should ensure that WAL files (or should I now say data?
;-) are flowing before issuing running the pg_start_backup()
function.  The documentation has always been pretty explicit about
that:
http://www.postgresql.org/docs/8.4/interactive/continuous-archiving.html
| 24.3.2. Making a Base Backup
| 
| The procedure for making a base backup is relatively simple:
| 
| 1. Ensure that WAL archiving is enabled and working.
| 
| 2. Connect to the database as a superuser, and issue the command:
| 
| SELECT pg_start_backup('label');
| ...
As long as the SR documentation is equally explicit on this point,
you'd have to be blatantly going against the instructions to hit
this.
Which makes me think that while pg_fail_backup() might actually be a
good idea, it's not really needed to solve this, so it's 9.1
material at best.
-Kevin


Re: pg_stop_backup does not complete

From
David Fetter
Date:
On Tue, Feb 23, 2010 at 06:58:22PM +0000, Simon Riggs wrote:
> On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
> 
> > 1) Set up a brand new master with an archive-commmand and
> > archive=on.
> > 
> > 2) Start the master
> > 
> > 3) Do a pg_start_backup()
> > 
> > 4) Realize, based on log error messages, that I've misconfigured
> > the archive_command.
> 
> > 5) Attempt to shut down the master.  Master tells me that
> > pg_stop_backup must be run in order to shut down.
> > 
> > 6) Execute pg_stop_backup.
> > 
> > 7) pg_stop_backup waits forever without ever stopping backup.
> > Ever 60 seconds, it give me a helpful "still waiting" message, but
> > at least in the amount of time I was willing to wait (5 minutes),
> > it never completed.
> > 
> > 8) do an immediate shutdown, as it's the only way I can get the
> > database unstuck.
> > 
> > With some experimentation, the problem seems to occur when you
> > have a failing archive_command and a master which currently has no
> > database traffic; for example, if I did some database write
> > activity (a createdb) then pg_stop_backup would complete after
> > about 60 seconds (which, btw, is extremely annoying, but at least
> > tolerable).
> > 
> > This issue is 100% reproduceable.
> 
> IMHO there in no problem in that behaviour. If somebody requests a
> backup then we should wait for it to complete. Kevin's suggestion of
> pg_fail_backup() is the only sensible conclusion there because it
> gives an explicit way out of deadlock.
> 
> ISTM the problem is that you didn't test. Steps 3 and 4 should have
> been reversed. Perhaps we should put something in the docs to say
> "and test".  The correct resolution is to put in an archive_command
> that works.

+1 for clarifying and extending the docs.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Tue, 2010-02-23 at 11:24 -0800, Joshua D. Drake wrote:

> This will bite us if we release like this.

No it won't. The current behaviour was put there by user request a few
releases back. This isn't a 9.0 issue, and as I've said its addressing
something that we now longer see as mainstream going forwards.

There are plenty of things that will bite us, but not this.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Robert Haas
Date:
On Tue, Feb 23, 2010 at 12:52 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
>> Simon, Fujii, All:
>>
>> While demoing HS/SR at SCALE, I ran into a problem which is likely to be
>> a commonly encountered bug when people first setup HS/SR.  Here's the
>> sequence:
>>
>> 1) Set up a brand new master with an archive-commmand and archive=on.
>>
>> 2) Start the master
>>
>> 3) Do a pg_start_backup()
>>
>> 4) Realize, based on log error messages, that I've misconfigured the
>> archive_command.
>>
>> 5) Attempt to shut down the master.  Master tells me that pg_stop_backup
>> must be run in order to shut down.
>
> If I issue a shutdown, PostgreSQL should do whatever it needs to do to
> shutdown; including issuing a pg_stop_backup.

Maybe.  But for sure, if it doesn't, and instead tells the user to
issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
user tries to execute it.  I gather that the problem is that it has to
finish all that outstanding archiving before shutting down, in which
case Kevin's suggestion of having a command to abort the backup seems
reasonable.  I might call it pg_abort_backup() rather than
pg_fail_backup(), but...

...Robert


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote:

> > If I issue a shutdown, PostgreSQL should do whatever it needs to do to
> > shutdown; including issuing a pg_stop_backup.
>
> Maybe.  But for sure, if it doesn't, and instead tells the user to
> issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
> user tries to execute it.

Right.

>   I gather that the problem is that it has to
> finish all that outstanding archiving before shutting down, in which
> case Kevin's suggestion of having a command to abort the backup seems
> reasonable.  I might call it pg_abort_backup() rather than
> pg_fail_backup(), but...
>

But...?

Joshua D. Drake


> ...Robert
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: pg_stop_backup does not complete

From
Robert Haas
Date:
On Tue, Feb 23, 2010 at 3:09 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote:
>
>> > If I issue a shutdown, PostgreSQL should do whatever it needs to do to
>> > shutdown; including issuing a pg_stop_backup.
>>
>> Maybe.  But for sure, if it doesn't, and instead tells the user to
>> issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
>> user tries to execute it.
>
> Right.
>
>>   I gather that the problem is that it has to
>> finish all that outstanding archiving before shutting down, in which
>> case Kevin's suggestion of having a command to abort the backup seems
>> reasonable.  I might call it pg_abort_backup() rather than
>> pg_fail_backup(), but...
>>
>
> But...?

But it seems like a good idea other than that.

...Robert


Re: pg_stop_backup does not complete

From
Fujii Masao
Date:
On Wed, Feb 24, 2010 at 4:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Maybe.  But for sure, if it doesn't, and instead tells the user to
> issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
> user tries to execute it.  I gather that the problem is that it has to
> finish all that outstanding archiving before shutting down, in which
> case Kevin's suggestion of having a command to abort the backup seems
> reasonable.  I might call it pg_abort_backup() rather than
> pg_fail_backup(), but...

Or how about adding new boolean parameter of pg_stop_backup() that
specifies whether WAL archiving needs to be waited?
   pg_stop_backup([wait boolean])

This parameter is optional. If true (default), it waits for archiving.

In warm-standby and SR, we don't need to wait for archiving before starting
the standby from the base backup. So pg_stop_backup(false) would be
useful for speedup of setup of log-shipping.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
> Simon, Fujii, All:
> 
> While demoing HS/SR at SCALE, I ran into a problem which is likely to be
> a commonly encountered bug when people first setup HS/SR.  Here's the
> sequence:
> 
> 1) Set up a brand new master with an archive-commmand and archive=on.
> 
> 2) Start the master
> 
> 3) Do a pg_start_backup()
> 
> 4) Realize, based on log error messages, that I've misconfigured the
> archive_command.
> 
> 5) Attempt to shut down the master.  Master tells me that pg_stop_backup
> must be run in order to shut down.

If I issue a shutdown, PostgreSQL should do whatever it needs to do to
shutdown; including issuing a pg_stop_backup. 

Joshua D. Drake




-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
On 2/23/10 10:58 AM, Simon Riggs wrote:
> So I don't see this as something that needs fixing for 9.0. There is
> already too much non-essential code there, all of which needs to be
> tested. I don't think adding in new corner cases to "help" people makes
> any sense until we have automated testing that allows us to rerun the
> regression tests to check all this stuff still works.

So, you're going to personally field the roughly 10,000 bug reports we
get on pgsql-general about this behaviour?  24/7?

The fact that we ran into this issue on the *first* day of testing the
new alpha4 is indicative of how common it will be -- it is not a corner
case, it is a common setup error which will affect probably 20% of new
users who try 9.0.  And new users are going to panic when they can't
shut down postgresql, not just test for issues.

Any situation where postgresql cannot be safely shut down because of a
common setup mistake (typoing an archive_command) is, IMNSHO, not
something we can release with.

--Josh Berkus


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Tue, 2010-02-23 at 18:58 +0000, Simon Riggs wrote:
> On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:

> > This issue is 100% reproduceable.
> 
> IMHO there in no problem in that behaviour. If somebody requests a
> backup then we should wait for it to complete. Kevin's suggestion of
> pg_fail_backup() is the only sensible conclusion there because it gives
> an explicit way out of deadlock.
> 
> ISTM the problem is that you didn't test. Steps 3 and 4 should have been
> reversed. Perhaps we should put something in the docs to say "and test".
> The correct resolution is to put in an archive_command that works.

The problem isn't that it is a bad archive_command, it is that
PostgreSQL has no way to deal with this gracefully. Yes people should
test but are we dealing with the real world or not?

> 
> So I don't see this as something that needs fixing for 9.0. There is
> already too much non-essential code there, all of which needs to be
> tested. I don't think adding in new corner cases to "help" people makes
> any sense until we have automated testing that allows us to rerun the
> regression tests to check all this stuff still works.

This will bite us if we release like this.

Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Tue, 2010-02-23 at 14:49 -0500, Robert Haas wrote:

> > If I issue a shutdown, PostgreSQL should do whatever it needs to do to
> > shutdown; including issuing a pg_stop_backup.
> 
> Maybe.  But for sure, if it doesn't, and instead tells the user to
> issue pg_stop_backup(), then pg_stop_backup() had better WORK when the
> user tries to execute it.

Right.

>   I gather that the problem is that it has to
> finish all that outstanding archiving before shutting down, in which
> case Kevin's suggestion of having a command to abort the backup seems
> reasonable.  I might call it pg_abort_backup() rather than
> pg_fail_backup(), but...
> 

But...?

Joshua D. Drake


> ...Robert
> 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Tue, 2010-02-23 at 17:46 -0800, Josh Berkus wrote:
> On 2/23/10 10:58 AM, Simon Riggs wrote:
> > So I don't see this as something that needs fixing for 9.0. There is
> > already too much non-essential code there, all of which needs to be
> > tested. I don't think adding in new corner cases to "help" people makes
> > any sense until we have automated testing that allows us to rerun the
> > regression tests to check all this stuff still works.
> 
> So, you're going to personally field the roughly 10,000 bug reports we
> get on pgsql-general about this behaviour?  24/7?

> The fact that we ran into this issue on the *first* day of testing the
> new alpha4 is indicative of how common it will be -- it is not a corner
> case, it is a common setup error which will affect probably 20% of new
> users who try 9.0.  And new users are going to panic when they can't
> shut down postgresql, not just test for issues.
> 
> Any situation where postgresql cannot be safely shut down because of a
> common setup mistake (typoing an archive_command) is, IMNSHO, not
> something we can release with.

It's not a common setup mistake. Nothing changed in this release and
this has never been reported before.

The behaviour to wait for pg_stop_backup() was added by user request.
The behaviour for shutdown to wait for pg_stop_backup() was also added
by user request.

Your mistake was not typoing an archive_command, it was not correctly
testing that what you had done was actually working. The fix is to read
the manual and correct the typo. Shutting down the server after failing
to configure it is not likely to be a normal reaction to experiencing an
error in configuration. Better docs might help you, but I doubt it.

ISTM you should collect test reports, then analyse and prioritise them.
This rates pretty low for me: low severity, low frequency.

(If new users panic when they can't do shutdown the server, they
probably won't like smart shutdown very much either.)

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
Simon,

> It's not a common setup mistake. Nothing changed in this release and
> this has never been reported before.
> 
> The behaviour to wait for pg_stop_backup() was added by user request.
> The behaviour for shutdown to wait for pg_stop_backup() was also added
> by user request.

Your two statements above contradict each other.

And, while it makes sense for smart shutdown to wait for
pg_stop_backup(), it does not make sense for fast shutdown to wait.

Aside from that, the main issue is not having shutdown wait for
pg_stop_backup; it's pg_stop_backup never completing.  An issue, I'll
note, you're ignoring.  If you're going to be this defensive whenever
anyone reports a bug, it's going to be veeeeeeery slow going to
troubleshoot HS.

As Robert Haas said: "But for sure, if it doesn't, and instead tells the
user to issue pg_stop_backup(), then pg_stop_backup() had better WORK
when the user tries to execute it."

> Your mistake was not typoing an archive_command, it was not correctly
> testing that what you had done was actually working. The fix is to read
> the manual and correct the typo. Shutting down the server after failing
> to configure it is not likely to be a normal reaction to experiencing an
> error in configuration.

The problem is you're thinking of an experienced PostgreSQL DBA doing
setup on a production server.  That's not what I'm talking about.  I'm
talking about the thousands of new users who are going to try PostgreSQL
for the first time because of HS/SR on a test installation.  If they
encounter this issue, they will decide (again) that PostgreSQL is too
hard to use and give up on us for another 5 years.

We've spent the last few years overcoming the image of PostgreSQL being
too complicated for most people to use.  You seem hell-bent on restoring
it. Given the timing, our project has one chance to establish a new
reputation as the SQL database for everybody.   User-hostile behavior
like this will ruin that chance.

Saying "RTFM and test, you newbie!" is not a valid response, and that's
what your "you should have read the docs" amounts to.  Heck, I *did*
read the docs.

> ISTM you should collect test reports, then analyse and prioritise them.
> This rates pretty low for me: low severity, low frequency.

To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of
LAPUG all find this highly problematic behavior.  So consider it 6
problem reports, not just one.

--Josh Berkus


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 10:07 -0800, Josh Berkus wrote:
> Simon,

> > Your mistake was not typoing an archive_command, it was not correctly
> > testing that what you had done was actually working. The fix is to read
> > the manual and correct the typo. Shutting down the server after failing
> > to configure it is not likely to be a normal reaction to experiencing an
> > error in configuration.
>
> The problem is you're thinking of an experienced PostgreSQL DBA doing
> setup on a production server.  That's not what I'm talking about.  I'm
> talking about the thousands of new users who are going to try PostgreSQL
> for the first time because of HS/SR on a test installation.  If they
> encounter this issue, they will decide (again) that PostgreSQL is too
> hard to use and give up on us for another 5 years.

Shoot forget the "new users", I am thinking about the hundreds of
thousands of existing NOT DBA users. E.g; 90% of our user base.


>
> Saying "RTFM and test, you newbie!" is not a valid response, and that's
> what your "you should have read the docs" amounts to.  Heck, I *did*
> read the docs.

Agreed. Although RTFM is important, we shouldn't have RTFM for something
that is clearly a user visible behavior mistake on our part.

>
> > ISTM you should collect test reports, then analyse and prioritise them.
> > This rates pretty low for me: low severity, low frequency.
>
> To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of
> LAPUG all find this highly problematic behavior.  So consider it 6
> problem reports, not just one.
>

Basically the reports boil down to people who are actually going to be
dealing with this in the field. Simon with respect you have been 6 feet
deep in code for too long on this. You need to step back and take some
constructive feedback from those that are dealing with the production
issues and do so with a smile.

Sincerely,

Joshua D. Drake





--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: pg_stop_backup does not complete

From
Robert Haas
Date:
On Wed, Feb 24, 2010 at 1:07 PM, Josh Berkus <josh@agliodbs.com> wrote:
> And, while it makes sense for smart shutdown to wait for
> pg_stop_backup(), it does not make sense for fast shutdown to wait.

TFM in fact says:

http://www.postgresql.org/docs/8.4/static/app-pg-ctl.html#APP-PG-CTL-DESCRIPTION

In stop mode, the server that is running in the specified data
directory is shut down. Three different shutdown methods can be
selected with the -m option: "Smart" mode waits for online backup mode
to finish and all the clients to disconnect. This is the default.
"Fast" mode does not wait for clients to disconnect and will terminate
an online backup in progress. All active transactions are rolled back
and clients are forcibly disconnected, then the server is shut down.
"Immediate" mode will abort all server processes without a clean
shutdown. This will lead to a recovery run on restart.

Your OP was not too clear about whether you tried a smart shutdown or
a fast shutdown, but if you meant a fast shutdown, this is apparently
(he says without testing) a regression.

...Robert


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> Your OP was not too clear about whether you tried a smart shutdown or
> a fast shutdown, but if you meant a fast shutdown, this is apparently
> (he says without testing) a regression.

Ah, sorry.  Yes, I attempted a fast shutdown.

--Josh Berkus


Re: pg_stop_backup does not complete

From
Heikki Linnakangas
Date:
Josh Berkus wrote:
> And, while it makes sense for smart shutdown to wait for
> pg_stop_backup(), it does not make sense for fast shutdown to wait.

Hang on, fast shutdown does *not* wait for backup to finish.

> Aside from that, the main issue is not having shutdown wait for
> pg_stop_backup; it's pg_stop_backup never completing.  An issue, I'll
> note, you're ignoring.

Ahh, that's a detail I missed too.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 10:07 -0800, Josh Berkus wrote:
> > 
> > The behaviour to wait for pg_stop_backup() was added by user request.
> > The behaviour for shutdown to wait for pg_stop_backup() was also added
> > by user request.
> 
> Your two statements above contradict each other.

No they don't.

> And, while it makes sense for smart shutdown to wait for
> pg_stop_backup(), it does not make sense for fast shutdown to wait.
> 
> Aside from that, the main issue is not having shutdown wait for
> pg_stop_backup; it's pg_stop_backup never completing.  An issue, I'll
> note, you're ignoring.  If you're going to be this defensive whenever
> anyone reports a bug, it's going to be veeeeeeery slow going to
> troubleshoot HS.

I haven't ignored the issue. The behaviour you are complaining about was
put there following complaints that it didn't wait. You're ignoring the
point that there hasn't been any change in this release and so your
comments are unfounded in reality.

> To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of
> LAPUG all find this highly problematic behavior.  So consider it 6
> problem reports, not just one.

:-)  I'm told that ignoring user groups is OK...

If you're going to address single issues rather than prioritise what is
important over what is not, you will get strange responses. 

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote:

> Basically the reports boil down to people who are actually going to be
> dealing with this in the field. Simon with respect you have been 6 feet
> deep in code for too long on this. You need to step back and take some
> constructive feedback from those that are dealing with the production
> issues and do so with a smile.

I receive constructive feedback all the time from the many users I deal
personally and directly with each week.

You make the mistake of assuming that someone that can develop has no
solution experience. That is exactly how I fund further development, so
you are off base by a long way.

The way this works currently is based on production feedback. This post
is about non-production usage. Until someone comes up with a truly
constructive suggestion that takes account of the issues that cause the
current design, it won't get traction with me.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
On 2/24/10 10:40 AM, Heikki Linnakangas wrote:
> Josh Berkus wrote:
>> And, while it makes sense for smart shutdown to wait for
>> pg_stop_backup(), it does not make sense for fast shutdown to wait.
> 
> Hang on, fast shutdown does *not* wait for backup to finish.

It did when I tried it.  I'll test to see what combination of factors
produces that.

>> Aside from that, the main issue is not having shutdown wait for
>> pg_stop_backup; it's pg_stop_backup never completing.  An issue, I'll
>> note, you're ignoring.
> 
> Ahh, that's a detail I missed too.

Yeah, that's the important one.  I went through the sequence:

1) Try to shut down.

2) be told to run pg_stop_backup()

3) run pg_stop_backup()

4) pg_stop_backup never completes.

Look at the original bug report on this thread; it has the details.  I
think it's still the issue that if no logs are being written (database
is idle) pg_stop_backup does not complete, which I thought we fixed, but
maybe not?

--Josh Berkus



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> I haven't ignored the issue. The behaviour you are complaining about was
> put there following complaints that it didn't wait. You're ignoring the
> point that there hasn't been any change in this release and so your
> comments are unfounded in reality.

I've posted a reproduceable bug (pg_stop_backup never terminating).
Either say that you tried to reproduce it and failed, or accept that it
exists.  Saying "that bug is impossible" is the denial of reality.

To reiterate yet again, the problem is that pg_stop_backup never
completes.  What we do on shutdown is a side issue.

--Josh Berkus



Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 11:07 -0800, Josh Berkus wrote:
> > I haven't ignored the issue. The behaviour you are complaining about was
> > put there following complaints that it didn't wait. You're ignoring the
> > point that there hasn't been any change in this release and so your
> > comments are unfounded in reality.
> 
> I've posted a reproduceable bug (pg_stop_backup never terminating).
> Either say that you tried to reproduce it and failed, or accept that it
> exists.  Saying "that bug is impossible" is the denial of reality.

You haven't posted a reproduceable bug, nor is this new to 9.0.

You have just noticed a production feature that was specifically put
there by user request. The feature exists, has done for some time now
and it's acting as it should.

This is about what happens in production, not your laptop. The required
behaviour in-production is to assume that the sysadmin has configured it
correctly and we wait for them to fix the problem. The previous
complaints were from people who felt they wanted to avoid invalid
backups.

Personally, I'd say there were many issues that are new to 9.0 that
really are important, and that this isn't one of them.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> You haven't posted a reproduceable bug, nor is this new to 9.0.

Yes, I have.

1) set up a failing archive_command on an idle database

2) do pg_start_backup

3) do pg_stop_backup

4) pg_stop_backup waits forever (or at least 5 minutes, which as long as
I've given it so far).

> This is about what happens in production, not your laptop. The required
> behaviour in-production is to assume that the sysadmin has configured it
> correctly and we wait for them to fix the problem. 

90% of our user base does not have a sysadmin.  Or, for that matter,
even a professional DBA.

> The previous
> complaints were from people who felt they wanted to avoid invalid
> backups.

People don't deploy PostgreSQL in production in the first place if it
has this kind of "no good option from here" failure when they first try
it.  HS/SR is for use by new users of PostgreSQL as well as the
experienced.

--Josh Berkus



Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote:
> On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote:

> You make the mistake of assuming that someone that can develop has no
> solution experience. That is exactly how I fund further development, so
> you are off base by a long way.

I never implied that. I implied that your perspective is currently
skewed. I stand by that implication.

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 11:31 -0800, Josh Berkus wrote:

> > This is about what happens in production, not your laptop. The required
> > behaviour in-production is to assume that the sysadmin has configured it
> > correctly and we wait for them to fix the problem. 
> 
> 90% of our user base does not have a sysadmin.  Or, for that matter,
> even a professional DBA.

Your logic is terrible. If there is no sysadmin, who would be typing the
pg_stop_backup() ? Who would have misconfigured it in the first place?

If you have a concrete proposal, get off your soapbox and make one,
based upon the technical information you've received. There are clear
reasons why things are the way they are and those reasons will not be
ignored, by me.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
Simon,

> If you have a concrete proposal, get off your soapbox and make one,
> based upon the technical information you've received. There are clear
> reasons why things are the way they are and those reasons will not be
> ignored, by me.

OK, can you go through the reasons why pg_stop_backup would not
complete?  And why it's a problem to have it complete?  I'll admit to
not understanding them; it seems to me that pg_stop_backup should just
immediately force a checkpoint and a log write, but you're obviously
trying to prevent something with the current behavior.  What are you
trying to prevent?

--Josh Berkus



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
On 2/24/10 11:55 AM, Joshua D. Drake wrote:
> On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote:
>> On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote:
> 
>> You make the mistake of assuming that someone that can develop has no
>> solution experience. That is exactly how I fund further development, so
>> you are off base by a long way.
> 
> I never implied that. I implied that your perspective is currently
> skewed. I stand by that implication. 

Let's kill the ad-hominem attacks guys.  Not productive.  Thanks.

--Josh Berkus



Re: pg_stop_backup does not complete

From
Heikki Linnakangas
Date:
Josh Berkus wrote:
> OK, can you go through the reasons why pg_stop_backup would not
> complete?  

pg_stop_backup() doesn't complete until all the WAL segments needed to
restore from the backup are archived. If archive_command is failing,
that never happens.

> And why it's a problem to have it complete? 

Because then you would conclude that the backup is finished and you have
all the data you need to restore safely in the archive. If
archive_command is failing, that's not happening.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> pg_stop_backup() doesn't complete until all the WAL segments needed to
> restore from the backup are archived. If archive_command is failing,
> that never happens.

OK, so we need a way out of that cycle if the user is issuing
pg_stop_backup because they *already know* that archive_command is
failing.  Right now, there's no way out other than a fast shutdown,
which is a bit user-hostile.

--Josh Berkus


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 12:32 -0800, Josh Berkus wrote:
> > pg_stop_backup() doesn't complete until all the WAL segments needed to
> > restore from the backup are archived. If archive_command is failing,
> > that never happens.
>
> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing.  Right now, there's no way out other than a fast shutdown,
> which is a bit user-hostile.

Hmmm well... changing the archive_command to /bin/true and issuing a HUP
would cause the command to succeed, but I still think that is over the
top. I prefer Kevin's solution or some variant thereof:

http://archives.postgresql.org/pgsql-hackers/2010-02/msg01853.php
http://archives.postgresql.org/pgsql-hackers/2010-02/msg01907.php


Sincerely,

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: pg_stop_backup does not complete

From
Heikki Linnakangas
Date:
Josh Berkus wrote:
>> pg_stop_backup() doesn't complete until all the WAL segments needed to
>> restore from the backup are archived. If archive_command is failing,
>> that never happens.
> 
> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing.  Right now, there's no way out other than a fast shutdown,

Sure there is. Just kill the session, Ctrl-c or similar.
pg_stop_backup() isn't actually doing anything at that point anymore;
it's just waiting for the files to be archived before returning.

Or fix archive_command, and pg_reload_conf().

BTW, if you want a timeout for that, you can use statement_timeout.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: pg_stop_backup does not complete

From
"Kevin Grittner"
Date:
Josh Berkus <josh@agliodbs.com> wrote:
>> pg_stop_backup() doesn't complete until all the WAL segments
>> needed to restore from the backup are archived. If
>> archive_command is failing, that never happens.
> 
> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing.  Right now, there's no way out other than a fast
> shutdown, which is a bit user-hostile.
So maybe pg_abort_backup() is needed for 9.0 after all?
(1)  You'd want to be able to run it either instead of
pg_stop_backup or to interrupt a pending one.
(2)  You wouldn't want the .backup file to be written.
(3)  What about the equivalent WAL end-of-backup record?
-Kevin


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 11:55 -0800, Joshua D. Drake wrote:
> On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote:
> > On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote:
> 
> > You make the mistake of assuming that someone that can develop has no
> > solution experience. That is exactly how I fund further development, so
> > you are off base by a long way.
> 
> I never implied that. I implied that your perspective is currently
> skewed. I stand by that implication. 

My perspective comes from knowing the code AND having production
experience with PostgreSQL many times over.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> pg_stop_backup() doesn't complete until all the WAL segments needed to
>> restore from the backup are archived. If archive_command is failing,
>> that never happens.

> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing.  Right now, there's no way out other than a fast shutdown,
> which is a bit user-hostile.

The pg_abort_backup() operation previously proposed seems like the only
workable compromise.  Simon is quite right to not want pg_stop_backup()
to behave in a way that could contribute to data loss; but on the other
hand there needs to be some clear way to get the system out of that
state at need.
        regards, tom lane


Re: pg_stop_backup does not complete

From
"David E. Wheeler"
Date:
On Feb 24, 2010, at 12:47 PM, Tom Lane wrote:

>> OK, so we need a way out of that cycle if the user is issuing
>> pg_stop_backup because they *already know* that archive_command is
>> failing.  Right now, there's no way out other than a fast shutdown,
>> which is a bit user-hostile.
> 
> The pg_abort_backup() operation previously proposed seems like the only
> workable compromise.  Simon is quite right to not want pg_stop_backup()
> to behave in a way that could contribute to data loss; but on the other
> hand there needs to be some clear way to get the system out of that
> state at need.

+1 makes sense.

David



Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Josh Berkus wrote:
>> pg_stop_backup() doesn't complete until all the WAL segments needed to
>> restore from the backup are archived. If archive_command is failing,
>> that never happens.
>>     
>
> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing.  Right now, there's no way out other than a fast shutdown,
> which is a bit user-hostile.
>   
gsmith=# select name,context from pg_settings where name like 'archive%';     name       |  context  
-----------------+------------archive_command | sighuparchive_mode    | postmasterarchive_timeout | sighup

I expect for your particular bad situation, you can replace the 
archive_command with a corrected one, use "pg_ctl reload" to send a 
SIGHUP to make that fix active, and escape from this.  That's the only 
right way out of this situation.  You can't just abort a backup someone 
has asked for just because archives are failing and allow the server to 
shutdown cleanly in this situation.  That's the wrong thing to do for 
production setups; the last thing you want for a system with archiving 
issues is to be stopped normally if it's interfering with an explicit 
admin requested backup.

Not necessarily any reason that backup even needs to fail, and no reason 
for the server to get restarted in this situation at all.  If the 
archive_command never returned false information, and in fact just 
returned a valid error code, all of the segments needed to make the 
backup consistent will be queued up waiting for the problem to be 
fixed.  Put the fixed archive_command in place, and you're off and 
running again.  If that's impossible, because the archive_command was 
really screwed up, we can just tell people to swap to an archive_command 
that just returns success, and let the queued up segments to be archived 
all get tossed away.  That backup will be bad, they fix the 
archive_command, send SIGHUP, and start over with a new backup.

There's some doc patches that could guide how to handle this situation 
better for sure, but I don't see any code changes needed.  Everything 
working as designed, optimized for production use at the expense of some 
confusion on how to recover if you configure things badly.

I suggested a patch a few weeks ago to make "what is the archiver 
doing?" behavior easier to monitor, got the impression people felt it 
was redundant given SR was the preferred path moving forward and 
eventually this whole archive_command bit would be going away.  I could 
revive that work if you feel this is such a bad issue that we need a 
better way to watch what the archiver is doing.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Josh Berkus wrote:
>> OK, so we need a way out of that cycle if the user is issuing
>> pg_stop_backup because they *already know* that archive_command is
>> failing.  Right now, there's no way out other than a fast shutdown,

> Sure there is. Just kill the session, Ctrl-c or similar.
> pg_stop_backup() isn't actually doing anything at that point anymore;
> it's just waiting for the files to be archived before returning.

One objection to this is that it's not very clear to the user when
pg_stop_backup has finished with actual work and is just waiting for the
archiver, ie when is it safe to hit control-C?  Maybe we should emit a
"backup done, waiting for archiver to complete" notice before entering
the sleep loop.
        regards, tom lane


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
Greg,

> I expect for your particular bad situation, you can replace the
> archive_command with a corrected one, use "pg_ctl reload" to send a
> SIGHUP to make that fix active, and escape from this.  That's the only
> right way out of this situation.  You can't just abort a backup someone
> has asked for just because archives are failing and allow the server to
> shutdown cleanly in this situation.  That's the wrong thing to do for
> production setups; the last thing you want for a system with archiving
> issues is to be stopped normally if it's interfering with an explicit
> admin requested backup.

Yeah, I can see that for large production setups with multiple staff.
We also need something newbie-friendly (and friendly to the large number
of users we have where the DBA/Sysadmin is just the most skilled web
developer) though.  The above procedure is far too complex for someone
who is "just trying out" PostgreSQL as a replacement for MySQL, and if
recent conferences are anything to go by, we're about to have several
thousand such users.

BTW, please stop treating this issue as something which happens "only to
Josh".  I wouldn't be raising it if it weren't a natural circumstance
which anyone who is trying PostgreSQL with HS/SR for the first time,
with no experience with Warm Standby, would get into.  Such new users
are *likely* to get archive_command wrong, and likely to want to start
over when they do.  If we make that painful for them, they'll just
switch to MySQL or CouchDB instead.

Thing is, if archive_command is failing, then the backup is useless
regardless until it's fixed.  And sending the archives to /dev/null (the
fix you're essentially recommending above) doesn't make the backup any
more useful.  So I'm seeing pg_abort_backup(), which also produces a
markers which prevent the backup from loading, as an improvement on
current UI.

--Josh Berkus


Re: pg_stop_backup does not complete

From
Heikki Linnakangas
Date:
Josh Berkus wrote:
> So I'm seeing pg_abort_backup(), which also produces a
> markers which prevent the backup from loading, as an improvement on
> current UI.

Starting with 9.0, if recovery doesn't see a end-of-backup record, it
will refuse to start up. In earlier versions we had a similar mechanism
using the backup history files.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Thing is, if archive_command is failing, then the backup is useless
> regardless until it's fixed.  And sending the archives to /dev/null (the
> fix you're essentially recommending above) doesn't make the backup any
> more useful.  So I'm seeing pg_abort_backup(), which also produces a
> markers which prevent the backup from loading, as an improvement on
> current UI.

On reflection I'm not sure what pg_abort_backup would do for you.
As Heikki points out, by the time the user has realized that
pg_stop_backup() is not completing, it's *already done* all of the
state changes it's going to make.  There is no way to take the
backup-complete WAL entry out of the WAL stream; it's already in there
and there's probably ordinary entries after it by now.  Having a
oh-the-backup-failed-after-all entry somewhere downstream of that is
entirely useless; the more so because by the time anything could *see*
such an entry, the problem would have been resolved, since the problem
is exactly not having gotten the WAL stream out to the archive.

Before you could enter pg_abort_backup you'd have to control-C out of
the pg_stop_backup call, and that action already accomplishes the only
thing pg_abort_backup could do for you.

So what I am thinking is that this is really just a minor bit of user
unfriendliness in pg_stop_backup.  We should address it with one or
both of these changes:

* emit a NOTICE as soon as pg_stop_backup's actual work is done and
it's starting to wait for the archiver (or maybe after it's waited
for a few seconds, but much less than the present 60).

* extend the existing WARNING (and the NOTICE too if we elect to have
one) with a HINT message explicitly saying that you can cancel the
wait but thus-and-such consequences might ensue.

Both of these things would only be helpful when using client software
that shows you received notices promptly.  psql is okay, but maybe
pgAdmin and other tools would need some further work.  There is not
much we can do about that in the core project though.
        regards, tom lane


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 13:30 -0800, Josh Berkus wrote:

> So I'm seeing pg_abort_backup(), which also produces a
> markers which prevent the backup from loading, as an improvement on
> current UI.

Since Kevin suggested this in his first post and I agreed with that in
the first paragraph of my first post, I think you've wasted a lot of
time here going in circles. 42 posts, more than a dozen people. I think
we have better things to do than this small issue, which has nothing at
all to do with a 9.0 feature. I think you should look at prioritisation.
There are many things seriously in need of fixing and this wasn't one of
them.

Please test the following patch to see if it meets your needs and check
the wordings used in the docs.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> Please test the following patch to see if it meets your needs and check
> the wordings used in the docs.

What exactly will that function accomplish, given the assumption that
the user already tried pg_stop_backup?
        regards, tom lane


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
Tom, Simon,

> * emit a NOTICE as soon as pg_stop_backup's actual work is done and
> it's starting to wait for the archiver (or maybe after it's waited
> for a few seconds, but much less than the present 60).
> 
> * extend the existing WARNING (and the NOTICE too if we elect to have
> one) with a HINT message explicitly saying that you can cancel the
> wait but thus-and-such consequences might ensue.
> 
> Both of these things would only be helpful when using client software
> that shows you received notices promptly.  psql is okay, but maybe
> pgAdmin and other tools would need some further work.  There is not
> much we can do about that in the core project though.

Well, the client software could be fixed in time for 9.0, I'd think.  I
think that implementing both of the above would probably do the trick
for user-friendliness, enough for 9.0.  If it's obvious to the user on
the console what to do, then they won't panic.

> Since Kevin suggested this in his first post and I agreed with that in
> the first paragraph of my first post, I think you've wasted a lot of
> time here going in circles. 42 posts, more than a dozen people. I think

Please tone down the hostility, Simon.  I don't think talking about an
issue I encountered while testing is a waste of anyone's time, it's how
we improve the software.  In fact, I'm hoping that potential testers are
noticing the drubbing you're getting over this, because belittling
anyone's bug reports is not exactly a good way to attract new testers to
the project.

Further, the multiple posts seem to have arrived at a minimally
intrusive solution, so it seems like time well spent.

> we have better things to do than this small issue, which has nothing at
> all to do with a 9.0 feature. I think you should look at prioritisation.
> There are many things seriously in need of fixing and this wasn't one of
> them.

You've made it clear, repeatedly, that you don't happen to think that
new user experience is a priority for 9.0.  However, a lot of us on this
list disagree with you, and will continue to do so.  Priorities are a
community decision, not an individual developer one.

--Josh Berkus


Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Josh Berkus wrote:
> Thing is, if archive_command is failing, then the backup is useless
> regardless until it's fixed.  And sending the archives to /dev/null (the
> fix you're essentially recommending above) doesn't make the backup any
> more useful.

That's not what I said to do first.  If it's possible to fix your 
archive_command, and it never returned bad "I'm saying success but I 
didn't really do the right thing" information to the server--it just 
failed--this situation is completely recoverable with no damage to the 
backup.  Just fix the archive_command, reload the configuration, and the 
queue of archived files will flow and eventually your consistent backup 
completes.  This it the only behavior someone who is trying to recover 
from a mistake  in production is likely to find acceptable, and as Simon 
has pointed out that is what the current situation is optimized for.

Only in the situation where the archive_command was so bad that it 
returned the wrong data to the server--saying the segment was saved but 
it really wasn't--did I suggest that you might as well change 
archive_command to go nowhere.  Because in that case, your backup is 
already screwed, you lost an essential piece of it.

As far your comment about treating this like it's a problem specific to 
you, did you miss the part where I pointed out I was just expressing 
concerns about poor visiblity into this area ("what is the archiver 
doing?") recently?  I'm well aware this path is full of difficult to 
escape from holes.  We just need to be careful not do something that 
screws over production users in the name of reducing the learning curve.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> That's not what I said to do first.  If it's possible to fix your
> archive_command, and it never returned bad "I'm saying success but I
> didn't really do the right thing" information to the server--it just
> failed--this situation is completely recoverable with no damage to the
> backup.  Just fix the archive_command, reload the configuration, and the
> queue of archived files will flow and eventually your consistent backup
> completes.  This it the only behavior someone who is trying to recover
> from a mistake  in production is likely to find acceptable, and as Simon
> has pointed out that is what the current situation is optimized for.

Right.  I'm pointing out that production and "trying out 9.0 for the
first time" are actually different circumstances, and we need to be able
to handle both gracefully.  Since, if people have a bad experience
trying it out for the first time, we'll never *get* to production.

> As far your comment about treating this like it's a problem specific to
> you, did you miss the part where I pointed out I was just expressing
> concerns about poor visiblity into this area ("what is the archiver
> doing?") recently?  I'm well aware this path is full of difficult to
> escape from holes.  We just need to be careful not do something that
> screws over production users in the name of reducing the learning curve.

I think Tom's idea is minimally intrusive, and deals with the central
problem, which is one of UI and visibility as you assessed.

--Josh Berkus



Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Tom, Simon,
>> * emit a NOTICE as soon as pg_stop_backup's actual work is done and
>> it's starting to wait for the archiver (or maybe after it's waited
>> for a few seconds, but much less than the present 60).
>> 
>> * extend the existing WARNING (and the NOTICE too if we elect to have
>> one) with a HINT message explicitly saying that you can cancel the
>> wait but thus-and-such consequences might ensue.
>> 
>> Both of these things would only be helpful when using client software
>> that shows you received notices promptly.  psql is okay, but maybe
>> pgAdmin and other tools would need some further work.  There is not
>> much we can do about that in the core project though.

> Well, the client software could be fixed in time for 9.0, I'd think.  I
> think that implementing both of the above would probably do the trick
> for user-friendliness, enough for 9.0.  If it's obvious to the user on
> the console what to do, then they won't panic.

If you like the concept, then the next question is exactly how to phrase
the messages.  All we have at the moment is the inside-the-delay-loop
warning:
   ereport(WARNING,           (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
            waits)));
 

which now that I look at it could use some wordsmithing itself.
Suggestions?
        regards, tom lane


Re: pg_stop_backup does not complete

From
"David E. Wheeler"
Date:
On Feb 24, 2010, at 3:24 PM, Tom Lane wrote:

> If you like the concept, then the next question is exactly how to phrase
> the messages.  All we have at the moment is the inside-the-delay-loop
> warning:
>
>    ereport(WARNING,
>            (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
>                    waits)));
>
> which now that I look at it could use some wordsmithing itself.

“Bitch, can’t you see that I’m still waiting for the archive to complete? WTF were you thinking? Jesus!”

My $0.02.

Best,

David

Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> If you like the concept, then the next question is exactly how to phrase
> the messages.  All we have at the moment is the inside-the-delay-loop
> warning:
> 
>     ereport(WARNING,
>             (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
>                     waits)));

Well, we'll want this message first, as soon as pg_stop_backup finishes
checkpointing:

WARNING: Stop backup work complete.  Now awaiting completion of WAL
archiving.

Then after 60s:

WARNING: pg_stop_backup is still waiting for WAL archiving to complete
(%d seconds elapsed).
HINT: Check if your WAL archive_command is failing.  You may abort
pg_stop_backup at this point, but you will not be able to use the
resulting clone.


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 16:52 -0500, Tom Lane wrote:

> Before you could enter pg_abort_backup you'd have to control-C out of
> the pg_stop_backup call, and that action already accomplishes the only
> thing pg_abort_backup could do for you.

Agreed. I was responding to perceived user need.

> So what I am thinking is that this is really just a minor bit of user
> unfriendliness in pg_stop_backup.  We should address it with one or
> both of these changes:
> 
> * emit a NOTICE as soon as pg_stop_backup's actual work is done and
> it's starting to wait for the archiver (or maybe after it's waited
> for a few seconds, but much less than the present 60).

Pointless really. Nobody runs backups in production by typing
pg_stop_backup() except in a demo. Nobody will see this. 

> * extend the existing WARNING (and the NOTICE too if we elect to have
> one) with a HINT message explicitly saying that you can cancel the
> wait but thus-and-such consequences might ensue.

If you can see the HINT, you can also see the WARNING. If you can see
the WARNING and do nothing, I don't think we need a "objects in the
mirror may be closer than they appear" message. If people can't work out
that if a) they are running something and b) that something is waiting
that they should cancel it then we aren't going to have much luck with
them.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 14:20 -0800, Josh Berkus wrote:
> Since Kevin suggested this in his first post and I agreed with that in
> > the first paragraph of my first post, I think you've wasted a lot of
> > time here going in circles. 42 posts, more than a dozen people. I
> think
> 
> Please tone down the hostility, Simon.  I don't think talking about an
> issue I encountered while testing is a waste of anyone's time, it's
> how we improve the software.  In fact, I'm hoping that potential
> testers are noticing the drubbing you're getting over this, because
> belittling anyone's bug reports is not exactly a good way to attract
> new testers to the project.

Saying "its not a bug" doesn't belittle your bug report. Your first
report was not time wasting, but talking endlessly about a subject that
you've had clear replies on becomes time wasting. As I've said many
times now, this isn't even an 9.0 issue. Expressing that opinion is not
hostility.

I'm not sure why you think *I* am receiving a drubbing? You made a
mistake on a demo, filed a bug report and wouldn't listen to people
telling you its not a bug. I admire your attempts at oneupmanship.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Wed, 2010-02-24 at 16:52 -0500, Tom Lane wrote:
>> * emit a NOTICE as soon as pg_stop_backup's actual work is done and
>> it's starting to wait for the archiver (or maybe after it's waited
>> for a few seconds, but much less than the present 60).

> Pointless really. Nobody runs backups in production by typing
> pg_stop_backup() except in a demo. Nobody will see this. 

I agree it's pointless in production, but this isn't about production,
it's about friendliness to people who are experimenting.  The case will
probably never come up in production because a production installation
should have a non-broken archive_command.

>> * extend the existing WARNING (and the NOTICE too if we elect to have
>> one) with a HINT message explicitly saying that you can cancel the
>> wait but thus-and-such consequences might ensue.

> If you can see the HINT, you can also see the WARNING. If you can see
> the WARNING and do nothing, I don't think we need a "objects in the
> mirror may be closer than they appear" message. If people can't work out
> that if a) they are running something and b) that something is waiting
> that they should cancel it then we aren't going to have much luck with
> them.

The value of the HINT I think would be to make them (a) not afraid to
hit control-C and (b) aware of the fact that their archiver has got
a problem.
        regards, tom lane


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 23:57 +0000, Simon Riggs wrote:
>
> > * emit a NOTICE as soon as pg_stop_backup's actual work is done and
> > it's starting to wait for the archiver (or maybe after it's waited
> > for a few seconds, but much less than the present 60).
>
> Pointless really. Nobody runs backups in production by typing
> pg_stop_backup() except in a demo. Nobody will see this.

This is not true. It is not uncommon for a pitr setup to get out of sync
for any number of production reasons. It is one of the reasons that
PITRTools supports executing a pg_stop_backup.

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.

Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Wed, 2010-02-24 at 19:08 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Wed, 2010-02-24 at 16:52 -0500, Tom Lane wrote:
> >> * emit a NOTICE as soon as pg_stop_backup's actual work is done and
> >> it's starting to wait for the archiver (or maybe after it's waited
> >> for a few seconds, but much less than the present 60).
> 
> > Pointless really. Nobody runs backups in production by typing
> > pg_stop_backup() except in a demo. Nobody will see this. 
> 
> I agree it's pointless in production, but this isn't about production,
> it's about friendliness to people who are experimenting.  The case will
> probably never come up in production because a production installation
> should have a non-broken archive_command.

No further objection.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 10:07 -0800, Josh Berkus wrote:
> Simon,

> > Your mistake was not typoing an archive_command, it was not correctly
> > testing that what you had done was actually working. The fix is to read
> > the manual and correct the typo. Shutting down the server after failing
> > to configure it is not likely to be a normal reaction to experiencing an
> > error in configuration.
> 
> The problem is you're thinking of an experienced PostgreSQL DBA doing
> setup on a production server.  That's not what I'm talking about.  I'm
> talking about the thousands of new users who are going to try PostgreSQL
> for the first time because of HS/SR on a test installation.  If they
> encounter this issue, they will decide (again) that PostgreSQL is too
> hard to use and give up on us for another 5 years.

Shoot forget the "new users", I am thinking about the hundreds of
thousands of existing NOT DBA users. E.g; 90% of our user base.


> 
> Saying "RTFM and test, you newbie!" is not a valid response, and that's
> what your "you should have read the docs" amounts to.  Heck, I *did*
> read the docs.

Agreed. Although RTFM is important, we shouldn't have RTFM for something
that is clearly a user visible behavior mistake on our part.

> 
> > ISTM you should collect test reports, then analyse and prioritise them.
> > This rates pretty low for me: low severity, low frequency.
> 
> To date, I, Robert Haas, Joe Conway, Josh Drake, and the members of
> LAPUG all find this highly problematic behavior.  So consider it 6
> problem reports, not just one.
> 

Basically the reports boil down to people who are actually going to be
dealing with this in the field. Simon with respect you have been 6 feet
deep in code for too long on this. You need to step back and take some
constructive feedback from those that are dealing with the production
issues and do so with a smile.

Sincerely,

Joshua D. Drake





-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Tom Lane wrote:
> The value of the HINT I think would be to make them (a) not afraid to
> hit control-C and (b) aware of the fact that their archiver has got
> a problem.
>
Agreed on both points.  Patch attached that implements something similar
to Josh's wording, tweaking the original warning too.  Here's what it
looks like when you run into the bad situation (which I easily simulated
with "archive_command='/bin/false'") from the client's perspective:

gsmith@meddle:~/pgwork/src/master/src$ psql -c "select
pg_start_backup('test')"
 pg_start_backup
-----------------
 0/5000020
(1 row)

gsmith@meddle:~/pgwork/src/master/src$ psql
psql (9.0devel)
Type "help" for help.

gsmith=# select pg_stop_backup();
NOTICE:  pg_stop_backup cleanup done, waiting for required segments to
archive
WARNING:  pg_stop_backup still waiting for all required segments to
archive (60 seconds elapsed)
HINT:  Confirm your archive_command is executing successfully.
pg_stop_backup can be aborted safely, but the resulting backup will not
be usable.
^CCancel request sent
ERROR:  canceling statement due to user request

And this is the sort of thing that shows up in the logs with default
logging behavior while all this is happening; you don't see the NOTICE,
but the WARNING and HINT are both there which I think is good:

LOG:  archive command failed with exit code 1
DETAIL:  The failed archive command was: /bin/false
WARNING:  transaction log file "000000010000000000000000" could not be
archived: too many failures
WARNING:  pg_stop_backup still waiting for all required segments to
archive (60 seconds elapsed)
HINT:  Confirm your archive_command is executing successfully.
pg_stop_backup can be aborted safely, but the resulting backup will not
be usable.

Does this solve the logging side of this?  You can still make a case for
a more forceful pg_stop_backup, this seems to at least remove much of
the mystery and frustration from the whole exercise.  This patch plus a
little documentation suggesting how to recover from this issue might be
enough.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca088b0..c09ede9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8125,6 +8125,9 @@ pg_stop_backup(PG_FUNCTION_ARGS)
     BackupHistoryFileName(histfilename, ThisTimeLineID, _logId, _logSeg,
                           startpoint.xrecoff % XLogSegSize);

+    ereport(NOTICE,
+            (errmsg("pg_stop_backup cleanup done, waiting for required segments to archive")));
+
     seconds_before_warning = 60;
     waits = 0;

@@ -8139,8 +8142,10 @@ pg_stop_backup(PG_FUNCTION_ARGS)
         {
             seconds_before_warning *= 2;        /* This wraps in >10 years... */
             ereport(WARNING,
-                    (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
-                            waits)));
+                    (errmsg("pg_stop_backup still waiting for all required segments to archive (%d seconds elapsed)",
+                            waits),
+                      errhint("Confirm your archive_command is executing successfully.  "
+                             "pg_stop_backup can be aborted safely, but the resulting backup will not be usable.")));
         }
     }


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
On 2/24/10 5:36 PM, Greg Smith wrote:
> gsmith=# select pg_stop_backup();
> NOTICE:  pg_stop_backup cleanup done, waiting for required segments to
> archive
> WARNING:  pg_stop_backup still waiting for all required segments to
> archive (60 seconds elapsed)
> HINT:  Confirm your archive_command is executing successfully. 
> pg_stop_backup can be aborted safely, but the resulting backup will not
> be usable.
> ^CCancel request sent
> ERROR:  canceling statement due to user request

This looks really good, thanks!

> Does this solve the logging side of this?  You can still make a case for
> a more forceful pg_stop_backup, this seems to at least remove much of
> the mystery and frustration from the whole exercise.  This patch plus a
> little documentation suggesting how to recover from this issue might be
> enough.

Yeah, the concern is user-friendliness.  As Simon points out, allowing
pg_stop_backup to abort would have other unexpected-results issues.

--Josh Berkus



Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> On 2/24/10 5:36 PM, Greg Smith wrote:
>> gsmith=# select pg_stop_backup();
>> NOTICE:  pg_stop_backup cleanup done, waiting for required segments to
>> archive
>> WARNING:  pg_stop_backup still waiting for all required segments to
>> archive (60 seconds elapsed)
>> HINT:  Confirm your archive_command is executing successfully. 
>> pg_stop_backup can be aborted safely, but the resulting backup will not
>> be usable.
>> ^CCancel request sent
>> ERROR:  canceling statement due to user request

> This looks really good, thanks!

The one thing I'm undecided about is whether we want the immediate
NOTICE, as opposed to dialing down the time till the first WARNING
to something like 5 or 10 seconds.  I think the main argument for the
latter approach would be to avoid log-spam in normal operation.
Although Greg is correct that a NOTICE wouldn't be logged at default
log levels, lots of people don't use that default.  Comments?
        regards, tom lane


Re: pg_stop_backup does not complete

From
David Fetter
Date:
On Wed, Feb 24, 2010 at 08:52:28PM -0500, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > On 2/24/10 5:36 PM, Greg Smith wrote:
> >> gsmith=# select pg_stop_backup();
> >> NOTICE:  pg_stop_backup cleanup done, waiting for required segments to
> >> archive
> >> WARNING:  pg_stop_backup still waiting for all required segments to
> >> archive (60 seconds elapsed)
> >> HINT:  Confirm your archive_command is executing successfully. 
> >> pg_stop_backup can be aborted safely, but the resulting backup will not
> >> be usable.
> >> ^CCancel request sent
> >> ERROR:  canceling statement due to user request
> 
> > This looks really good, thanks!
> 
> The one thing I'm undecided about is whether we want the immediate
> NOTICE, as opposed to dialing down the time till the first WARNING
> to something like 5 or 10 seconds.  I think the main argument for
> the latter approach would be to avoid log-spam in normal operation.
> Although Greg is correct that a NOTICE wouldn't be logged at default
> log levels, lots of people don't use that default.  Comments?

As I see it, the clarity concern trumps the log spam one.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Tom Lane wrote:
> The one thing I'm undecided about is whether we want the immediate
> NOTICE, as opposed to dialing down the time till the first WARNING
> to something like 5 or 10 seconds.  I think the main argument for the
> latter approach would be to avoid log-spam in normal operation

I though about that for a minute, but didn't think pg_stop_backup is a 
common enough operation that anyone will complain that it's a little 
more verbose in its logging now.  I know when I was new to this, I used 
to wonder just what it was busy doing just after executing this command 
when it hung there for a while sometimes, and would have welcomed this 
extra bit of detail--preferably immediately, not even after a 5 or 10 
second delay.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
On 2/24/10 5:58 PM, Greg Smith wrote:
> 
> I though about that for a minute, but didn't think pg_stop_backup is a
> common enough operation that anyone will complain that it's a little
> more verbose in its logging now.  I know when I was new to this, I used
> to wonder just what it was busy doing just after executing this command
> when it hung there for a while sometimes, and would have welcomed this
> extra bit of detail--preferably immediately, not even after a 5 or 10
> second delay.

+1

--Josh


Re: pg_stop_backup does not complete

From
Fujii Masao
Date:
On Thu, Feb 25, 2010 at 10:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The one thing I'm undecided about is whether we want the immediate
> NOTICE, as opposed to dialing down the time till the first WARNING
> to something like 5 or 10 seconds.  I think the main argument for the
> latter approach would be to avoid log-spam in normal operation.
> Although Greg is correct that a NOTICE wouldn't be logged at default
> log levels, lots of people don't use that default.  Comments?

I don't want that immediate NOTICE message, which sounds like a noise.
Delaying it or changing the log level to DEBUG work for me.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> I don't want that immediate NOTICE message, which sounds like a noise.
> Delaying it or changing the log level to DEBUG work for me.

Problem is that a new user won't be seeing DEBUG messages by default.
This issue is all about new user experience.

Alternatively, we could move the time of the first "waiting for archive"
message up, but that seems misleading.

--Josh Berkus



Re: pg_stop_backup does not complete

From
Tom Lane
Date:
Greg Smith <greg@2ndquadrant.com> writes:
> Tom Lane wrote:
>> The value of the HINT I think would be to make them (a) not afraid to
>> hit control-C and (b) aware of the fact that their archiver has got
>> a problem.
>> 
> Agreed on both points.  Patch attached that implements something similar 
> to Josh's wording, tweaking the original warning too.

OK, everyone likes the immediate NOTICE.  I did a bit of copy-editing
and committed the attached version.
        regards, tom lane

Index: xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.377
diff -c -r1.377 xlog.c
*** xlog.c    19 Feb 2010 10:51:03 -0000    1.377
--- xlog.c    25 Feb 2010 02:15:49 -0000
***************
*** 8132,8138 ****      *      * We wait forever, since archive_command is supposed to work and we      * assume the
adminwanted his backup to work completely. If you don't
 
!      * wish to wait, you can set statement_timeout.      */     XLByteToPrevSeg(stoppoint, _logId, _logSeg);
XLogFileName(lastxlogfilename,ThisTimeLineID, _logId, _logSeg);
 
--- 8132,8139 ----      *      * We wait forever, since archive_command is supposed to work and we      * assume the
adminwanted his backup to work completely. If you don't
 
!      * wish to wait, you can set statement_timeout.  Also, some notices
!      * are issued to clue in anyone who might be doing this interactively.      */     XLByteToPrevSeg(stoppoint,
_logId,_logSeg);     XLogFileName(lastxlogfilename, ThisTimeLineID, _logId, _logSeg);
 
***************
*** 8141,8146 ****
--- 8142,8150 ----     BackupHistoryFileName(histfilename, ThisTimeLineID, _logId, _logSeg,
startpoint.xrecoff% XLogSegSize); 
 
+     ereport(NOTICE,
+             (errmsg("pg_stop_backup cleanup done, waiting for required WAL segments to be archived")));
+      seconds_before_warning = 60;     waits = 0; 
***************
*** 8155,8162 ****         {             seconds_before_warning *= 2;        /* This wraps in >10 years... */
 ereport(WARNING,
 
!                     (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
!                             waits)));         }     } 
--- 8159,8169 ----         {             seconds_before_warning *= 2;        /* This wraps in >10 years... */
 ereport(WARNING,
 
!                     (errmsg("pg_stop_backup still waiting for all required WAL segments to be archived (%d seconds
elapsed)",
!                             waits),
!                      errhint("Check that your archive_command is executing properly. "
!                              "pg_stop_backup can be cancelled safely, "
!                              "but the database backup will not be usable without all the WAL segments.")));         }
   } 
 


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 19:02 +0000, Simon Riggs wrote:
> On Wed, 2010-02-24 at 10:17 -0800, Joshua D. Drake wrote:

> You make the mistake of assuming that someone that can develop has no
> solution experience. That is exactly how I fund further development, so
> you are off base by a long way.

I never implied that. I implied that your perspective is currently
skewed. I stand by that implication. 

Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 12:32 -0800, Josh Berkus wrote:
> > pg_stop_backup() doesn't complete until all the WAL segments needed to
> > restore from the backup are archived. If archive_command is failing,
> > that never happens.
> 
> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing.  Right now, there's no way out other than a fast shutdown,
> which is a bit user-hostile.

Hmmm well... changing the archive_command to /bin/true and issuing a HUP
would cause the command to succeed, but I still think that is over the
top. I prefer Kevin's solution or some variant thereof:

http://archives.postgresql.org/pgsql-hackers/2010-02/msg01853.php
http://archives.postgresql.org/pgsql-hackers/2010-02/msg01907.php


Sincerely,

Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Wed, 2010-02-24 at 23:57 +0000, Simon Riggs wrote:
>  
> > * emit a NOTICE as soon as pg_stop_backup's actual work is done and
> > it's starting to wait for the archiver (or maybe after it's waited
> > for a few seconds, but much less than the present 60).
> 
> Pointless really. Nobody runs backups in production by typing
> pg_stop_backup() except in a demo. Nobody will see this. 

This is not true. It is not uncommon for a pitr setup to get out of sync
for any number of production reasons. It is one of the reasons that
PITRTools supports executing a pg_stop_backup.

Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.



Re: pg_stop_backup does not complete

From
Greg Stark
Date:
On Wed, Feb 24, 2010 at 11:14 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
> Right.  I'm pointing out that production and "trying out 9.0 for the
> first time" are actually different circumstances, and we need to be able
> to handle both gracefully.  Since, if people have a bad experience
> trying it out for the first time, we'll never *get* to production.

Fwiw if it's not clear what's going on when you're trying out
something carefully for the first time it's 10x worse if you're stuck
in a situation like this when you have people breathing down your neck
yelling about how they're losing money for every second you're down.

In an ideal world it would be best if pg_stop_backup could actually
print the error status of the archiving command. Is there any way for
it to get ahold of the fact that the archiving is failing?

And do we have closure on whether a "fast" shutdown is hanging? Or was
that actually a smart shutdown?

Perhaps "smart" shutdown needs to print out what it's waiting on
periodically as well, and suggest a fast shutdown to abort those
transactions.

--
greg


Re: pg_stop_backup does not complete

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Josh Berkus wrote:
> > OK, can you go through the reasons why pg_stop_backup would not
> > complete?  
> 
> pg_stop_backup() doesn't complete until all the WAL segments needed to
> restore from the backup are archived. If archive_command is failing,
> that never happens.

Yes, very old behavior allowed people to think they had a full backup
when the WAL files needed were not all archived, which was a bad thing. 
Thankfully no one reported catastrophic failure from the old behavior.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.comPG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard
drive,Christ can be your backup. +
 


Re: pg_stop_backup does not complete

From
Josh Berkus
Date:
> In an ideal world it would be best if pg_stop_backup could actually
> print the error status of the archiving command. 

Agreed.

> And do we have closure on whether a "fast" shutdown is hanging? Or was
> that actually a smart shutdown?

No, I need to retest and verify 100% that the issue wasn't something
other than stop_backup.

> Perhaps "smart" shutdown needs to print out what it's waiting on
> periodically as well, and suggest a fast shutdown to abort those
> transactions.

That would be a good thing to have for PostgreSQL in general.  Given
that any number of things can stop a smart shutdown, it's more than a
little baffling to users why one hangs forever.

BUT ... since most users run smart shutdown via a services script,
output on what shutdown is waiting on would need to be written to the
log rather than given interactively.

--Josh Berkus


Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Greg Stark wrote:
> In an ideal world it would be best if pg_stop_backup could actually
> print the error status of the archiving command. Is there any way for
> it to get ahold of the fact that the archiving is failing?
>   

This is in the area I mentioned I'd proposed a patch to improve not too 
long ago.  The archiver doesn't tell anyone anything about what it's 
doing right now, or even save its state information.  I made a proposal 
for making the bit it's currently working on (or just finished, or both) 
visible not too long ago:  
http://archives.postgresql.org/message-id/4B4FEA18.5080705@2ndquadrant.com

The main content for that was tracking disk space, which wandered into a 
separate discussion, but it would be easy enough to use the information 
that intends to export ("what archive file is currently being 
processed?") and print that in the error message too.  Makes it easy 
enough for people to infer the command is failing if the same segment 
number shows up every time in that message.

I didn't finish that only because the CF kicked off and I switched out 
of new development to review.  Since this class of error keeps popping 
up, I could easily finish that patch off by next week and see if it 
helps here.  I thought it was a long overdue bit of monitoring to add to 
the database anyway, just never had the time to work on it before.

> And do we have closure on whether a "fast" shutdown is hanging? Or was
> that actually a smart shutdown?
>   

When I tested this myself, a smart shutdown hung every time, while a 
fast one blew right through the problem--matching what's described in 
the manual.  Josh suggested at one point he might have seen a situation 
where fast shutdown wasn't sufficient to work around this and an 
immediate one was required.  Certainly possible that happened for an as 
yet unknown reason--I've seen plenty of situations where fast shutdown 
didn't work--but I haven't been able to replicate it.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: pg_stop_backup does not complete

From
Bruce Momjian
Date:
Joshua D. Drake wrote:
> On Wed, 2010-02-24 at 12:32 -0800, Josh Berkus wrote:
> > > pg_stop_backup() doesn't complete until all the WAL segments needed to
> > > restore from the backup are archived. If archive_command is failing,
> > > that never happens.
> > 
> > OK, so we need a way out of that cycle if the user is issuing
> > pg_stop_backup because they *already know* that archive_command is
> > failing.  Right now, there's no way out other than a fast shutdown,
> > which is a bit user-hostile.
> 
> Hmmm well... changing the archive_command to /bin/true and issuing a HUP
> would cause the command to succeed, but I still think that is over the
> top. I prefer Kevin's solution or some variant thereof:
> 
> http://archives.postgresql.org/pgsql-hackers/2010-02/msg01853.php
> http://archives.postgresql.org/pgsql-hackers/2010-02/msg01907.php

Postgres 9.0 will be the first release to mention /bin/true as a way of
turning off archiving in extraordinary circumstances:
http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.comPG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard
drive,Christ can be your backup. +
 


Re: pg_stop_backup does not complete

From
Bruce Momjian
Date:
Looks like we arrived at the best solution here.  I don't think it was
clear to users that pg_stop_backup() was issuing an archive_command and
hence they wouldn't be likely to understand the delay or correct a
problem.  This gives them the information they need at the time they
need it.

---------------------------------------------------------------------------

Tom Lane wrote:
> Greg Smith <greg@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> The value of the HINT I think would be to make them (a) not afraid to
> >> hit control-C and (b) aware of the fact that their archiver has got
> >> a problem.
> >> 
> > Agreed on both points.  Patch attached that implements something similar 
> > to Josh's wording, tweaking the original warning too.
> 
> OK, everyone likes the immediate NOTICE.  I did a bit of copy-editing
> and committed the attached version.
> 
>             regards, tom lane
> 
> Index: xlog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.377
> diff -c -r1.377 xlog.c
> *** xlog.c    19 Feb 2010 10:51:03 -0000    1.377
> --- xlog.c    25 Feb 2010 02:15:49 -0000
> ***************
> *** 8132,8138 ****
>        *
>        * We wait forever, since archive_command is supposed to work and we
>        * assume the admin wanted his backup to work completely. If you don't
> !      * wish to wait, you can set statement_timeout.
>        */
>       XLByteToPrevSeg(stoppoint, _logId, _logSeg);
>       XLogFileName(lastxlogfilename, ThisTimeLineID, _logId, _logSeg);
> --- 8132,8139 ----
>        *
>        * We wait forever, since archive_command is supposed to work and we
>        * assume the admin wanted his backup to work completely. If you don't
> !      * wish to wait, you can set statement_timeout.  Also, some notices
> !      * are issued to clue in anyone who might be doing this interactively.
>        */
>       XLByteToPrevSeg(stoppoint, _logId, _logSeg);
>       XLogFileName(lastxlogfilename, ThisTimeLineID, _logId, _logSeg);
> ***************
> *** 8141,8146 ****
> --- 8142,8150 ----
>       BackupHistoryFileName(histfilename, ThisTimeLineID, _logId, _logSeg,
>                             startpoint.xrecoff % XLogSegSize);
>   
> +     ereport(NOTICE,
> +             (errmsg("pg_stop_backup cleanup done, waiting for required WAL segments to be archived")));
> + 
>       seconds_before_warning = 60;
>       waits = 0;
>   
> ***************
> *** 8155,8162 ****
>           {
>               seconds_before_warning *= 2;        /* This wraps in >10 years... */
>               ereport(WARNING,
> !                     (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
> !                             waits)));
>           }
>       }
>   
> --- 8159,8169 ----
>           {
>               seconds_before_warning *= 2;        /* This wraps in >10 years... */
>               ereport(WARNING,
> !                     (errmsg("pg_stop_backup still waiting for all required WAL segments to be archived (%d seconds
elapsed)",
> !                             waits),
> !                      errhint("Check that your archive_command is executing properly. "
> !                              "pg_stop_backup can be cancelled safely, "
> !                              "but the database backup will not be usable without all the WAL segments.")));
>           }
>       }
>   
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.comPG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard
drive,Christ can be your backup. +
 


Re: pg_stop_backup does not complete

From
Bernd Helmle
Date:

--On 24. Februar 2010 16:01:02 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote:

> One objection to this is that it's not very clear to the user when
> pg_stop_backup has finished with actual work and is just waiting for the
> archiver, ie when is it safe to hit control-C?  Maybe we should emit a
> "backup done, waiting for archiver to complete" notice before entering
> the sleep loop.

+1 for this. This hint would certainly help to recognize the issue 
immediately (or at least point to a possible cause).

-- 
Thanks
Bernd


Re: pg_stop_backup does not complete

From
Greg Stark
Date:
On Fri, Feb 26, 2010 at 9:41 AM, Bernd Helmle <mailings@oopsware.de> wrote:
>
>
> --On 24. Februar 2010 16:01:02 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> One objection to this is that it's not very clear to the user when
>> pg_stop_backup has finished with actual work and is just waiting for the
>> archiver, ie when is it safe to hit control-C?  Maybe we should emit a
>> "backup done, waiting for archiver to complete" notice before entering
>> the sleep loop.
>
> +1 for this. This hint would certainly help to recognize the issue
> immediately (or at least point to a possible cause).

So looking at the code we *do* print something in pg_stop_backup(). We
just wait 60s before doing so. I propose we shorten that to 10s.

Secondarily, the message printed at this time and when the process is
finished doesn't actually give the user any information on how much
longer to expect the process to take.

It would be nice to say what the target archive log we're waiting on
is and then periodically print out what the last archived log file
was. Or perhaps just do the arithmetic and periodically print how many
megabytes of log files remain to be archived.


--
greg


Re: pg_stop_backup does not complete

From
Fujii Masao
Date:
On Fri, Feb 26, 2010 at 2:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Postgres 9.0 will be the first release to mention /bin/true as a way of
> turning off archiving in extraordinary circumstances:
>
>        http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html


> Setting archive_mode to a command that does nothing but return true, e.g. /bin/true,

"return true" seems ambiguous for me. How about writing clearly
"return a zero exit status" instead?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_stop_backup does not complete

From
Fujii Masao
Date:
On Fri, Feb 26, 2010 at 10:00 PM, Greg Stark <gsstark@mit.edu> wrote:
> Secondarily, the message printed at this time and when the process is
> finished doesn't actually give the user any information on how much
> longer to expect the process to take.
>
> It would be nice to say what the target archive log we're waiting on
> is and then periodically print out what the last archived log file
> was.

+1

We would be easily able to calculate the last archived log file from
the existence of archive status files.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Fujii Masao wrote: <blockquote cite="mid:3f0b79eb1003012220x358e072atc7e3f322d1d24466@mail.gmail.com" type="cite"><pre
wrap="">OnFri, Feb 26, 2010 at 2:47 AM, Bruce Momjian <a class="moz-txt-link-rfc2396E"
href="mailto:bruce@momjian.us"><bruce@momjian.us></a>wrote: </pre><blockquote type="cite"><pre wrap="">Postgres
9.0will be the first release to mention /bin/true as a way of
 
turning off archiving in extraordinary circumstances:

       <a class="moz-txt-link-freetext"
href="http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html">http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html</a>
 </pre></blockquote><blockquote type="cite"><pre wrap="">Setting archive_mode to a command that does nothing but return
true,e.g. /bin/true,   </pre></blockquote><pre wrap="">
 
"return true" seems ambiguous for me. How about writing clearly
"return a zero exit status" instead? </pre></blockquote><br /> This is a good catch, and I have a work in progress
updateto that doc section that fixes that wording, as well as rearranging the recent additions a bit.  Really that
whole"/bin/true" big needs to go after the example.  A very brief intro to what "exit status" means on various
platformsmight be in order too.  I'm adjusting all that to read better, once I'm happy with it I'll submit a doc patch
inthe next week or two with the final result.<br /><br /><pre class="moz-signature" cols="72">-- 
 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>   <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a>
 
</pre>

Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Fujii Masao wrote:
> We would be easily able to calculate the last archived log file from
> the existence of archive status files.
>   

Right, but you have to actually scan the whole archive directory to 
figure that out, and I'd rather not see that code get duplicated 
somewhere else when it's already inside the archive_command logic.  If 
it just shared that info with the rest of the system instead this would 
be trivial to discover.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: pg_stop_backup does not complete

From
Heikki Linnakangas
Date:
Greg Smith wrote:
> Fujii Masao wrote:
>> We would be easily able to calculate the last archived log file from
>> the existence of archive status files.
> 
> Right, but you have to actually scan the whole archive directory to
> figure that out, and I'd rather not see that code get duplicated
> somewhere else when it's already inside the archive_command logic.  If
> it just shared that info with the rest of the system instead this would
> be trivial to discover.

The archiver process is not connected to shared memory, so scanning the
directory is the way to do it.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: pg_stop_backup does not complete

From
"Joshua D. Drake"
Date:
On Tue, 2 Mar 2010 15:20:36 +0900, Fujii Masao <masao.fujii@gmail.com>
wrote:

>> Setting archive_mode to a command that does nothing but return true,
>> e.g. /bin/true,
>
> "return true" seems ambiguous for me. How about writing clearly
> "return a zero exit status" instead?

For the record. I hate the fact that I ever mentioned this and I think it
is a terrible hack that we would mention it in the docs.
>From a professional perspective, I cringe at the idea of telling a
customer to do this, not to mention it won't work on w32.
Joshua D. Drake
--
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Tue, 2010-03-02 at 15:20 +0900, Fujii Masao wrote:
> On Fri, Feb 26, 2010 at 2:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Postgres 9.0 will be the first release to mention /bin/true as a way of
> > turning off archiving in extraordinary circumstances:
> >
> >        http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html
> 
> 
> > Setting archive_mode to a command that does nothing but return true, e.g. /bin/true,
> 
> "return true" seems ambiguous for me. How about writing clearly
> "return a zero exit status" instead?

Docs are already quite clear on that point. I think we should avoid
specifying it twice.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Greg Stark
Date:
On Tue, Mar 2, 2010 at 9:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > Setting archive_mode to a command that does nothing but return true, e.g. /bin/true,
>>
>> "return true" seems ambiguous for me. How about writing clearly
>> "return a zero exit status" instead?
>
> Docs are already quite clear on that point. I think we should avoid
> specifying it twice.
>

Why do we disallow turning off archive_mode anyways? I understand not
turning it on -- though even that would be nice if it "took effect
after the next checkpoint" but turning it off should always be safe,
no?



-- 
greg


Re: pg_stop_backup does not complete

From
Simon Riggs
Date:
On Tue, 2010-03-02 at 13:13 +0000, Greg Stark wrote:
> On Tue, Mar 2, 2010 at 9:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> > Setting archive_mode to a command that does nothing but return true, e.g. /bin/true,
> >>
> >> "return true" seems ambiguous for me. How about writing clearly
> >> "return a zero exit status" instead?
> >
> > Docs are already quite clear on that point. I think we should avoid
> > specifying it twice.
> >
> 
> Why do we disallow turning off archive_mode anyways? 

Because it is needed for safety and nobody has got around to coding the
idea of turning it on/off during normal running, which is possible, with
appropriate care.

> I understand not
> turning it on -- though even that would be nice if it "took effect
> after the next checkpoint" but turning it off should always be safe,
> no?

We don't support that behaviour in parameters.

-- Simon Riggs           www.2ndQuadrant.com



Re: pg_stop_backup does not complete

From
Greg Smith
Date:
Simon Riggs wrote:
> On Tue, 2010-03-02 at 13:13 +0000, Greg Stark wrote:
>   
>> Why do we disallow turning off archive_mode anyways? 
>>     
>
> Because it is needed for safety and nobody has got around to coding the
> idea of turning it on/off during normal running, which is possible, with
> appropriate care.
>   

It's actually made it pretty high up on the list of desired features for 
some of the replication projects:  
http://wiki.postgresql.org/wiki/ClusterFeatures#Start.2Fstop_archiving_at_runtime

Since that is one of the easier items on that list to actually knock off 
(probably an order of magnitude so than the average feature there), it's 
completely feasible somebody will do so for 9.1.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us