Thread: Improve shutdown during online backup, take 2

Improve shutdown during online backup, take 2

From
"Albe Laurenz"
Date:
This patch replaces the previous version
http://archives.postgresql.org/pgsql-patches/2008-04/msg00004.php

As suggested by Simon Riggs in
http://archives.postgresql.org/pgsql-patches/2008-04/msg00013.php ,
a smart shutdown will now wait for online backup mode to finish.
The functions that touch "backup_label" have been moved to xlog.c.

As suggested by Heikki Linnakangas in
http://archives.postgresql.org/pgsql-patches/2008-04/msg00180.php ,
I have introduced a new postmaster state PM_WAIT_BACKUP.
In this state the postmaster will still accept connections.

Thanks for the feedback!
I'll try to add a link at the Wiki page.

Yours,
Laurenz Albe


Attachment

Re: Improve shutdown during online backup, take 2

From
Simon Riggs
Date:
On Wed, 2008-04-09 at 14:58 +0200, Albe Laurenz wrote:
> This patch replaces the previous version
> http://archives.postgresql.org/pgsql-patches/2008-04/msg00004.php
>
> As suggested by Simon Riggs in
> http://archives.postgresql.org/pgsql-patches/2008-04/msg00013.php ,
> a smart shutdown will now wait for online backup mode to finish.
> The functions that touch "backup_label" have been moved to xlog.c.
>
> As suggested by Heikki Linnakangas in
> http://archives.postgresql.org/pgsql-patches/2008-04/msg00180.php ,
> I have introduced a new postmaster state PM_WAIT_BACKUP.
> In this state the postmaster will still accept connections.
>
> Thanks for the feedback!
> I'll try to add a link at the Wiki page.

Patch applies, and works as described. Looks good for final apply.

Few minor thoughts:

* Text in pg_ctl should be WARNING, not Warning.
* CancelBackup() API looks strange, not sure why
* Need to mention that CancelBackup() is not the right way to end a
backup, so that function and pg_stop_backup should reference each other

Other than those, I like it. Very useful patch.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Improve shutdown during online backup, take 3

From
"Albe Laurenz"
Date:
Simon Riggs wrote:
> Patch applies, and works as described. Looks good for final apply.
>
> Few minor thoughts:
>
> * Text in pg_ctl should be WARNING, not Warning.
> * CancelBackup() API looks strange, not sure why
> * Need to mention that CancelBackup() is not the right way to end a
> backup, so that function and pg_stop_backup should reference
> each other
>
> Other than those, I like it. Very useful patch.

Thanks for the feedback!

- I have replaced "Warning" with WARNING".
- I have changed the API of CancelBackup() to return void.
  I don't use the return code anyway, and I guess it's less confusing
  and "strange" that way.
- I have added comments to disambiguate pg_stop_backup() and CancelBackup().

CancelBackup now writes a message to the server log if it cannot delete
backup_label - I hope that's not too verbose...

Yours,
Laurenz Albe

Attachment

Re: Improve shutdown during online backup, take 3

From
Magnus Hagander
Date:
Albe Laurenz wrote:
> Simon Riggs wrote:
> > Patch applies, and works as described. Looks good for final apply.
> >
> > Few minor thoughts:
> >
> > * Text in pg_ctl should be WARNING, not Warning.
> > * CancelBackup() API looks strange, not sure why
> > * Need to mention that CancelBackup() is not the right way to end a
> > backup, so that function and pg_stop_backup should reference
> > each other
> >
> > Other than those, I like it. Very useful patch.
>
> Thanks for the feedback!
>
> - I have replaced "Warning" with WARNING".
> - I have changed the API of CancelBackup() to return void.
>   I don't use the return code anyway, and I guess it's less confusing
>   and "strange" that way.
> - I have added comments to disambiguate pg_stop_backup() and
> CancelBackup().
>
> CancelBackup now writes a message to the server log if it cannot
> delete backup_label - I hope that's not too verbose...

This doesn't look like our normal coding standards, and should probably
be changed:
+     if (0 != stat(BACKUP_LABEL_FILE, &stat_buf))

(there's a number of similar places)

//Magnus

Re: Improve shutdown during online backup, take 3

From
"Albe Laurenz"
Date:
Magnus Hagander wrote:
> This doesn't look like our normal coding standards, and should probably
> be changed:
> +     if (0 != stat(BACKUP_LABEL_FILE, &stat_buf))
>
> (there's a number of similar places)

Lacking guidelines, I now tried to copy how stat(2) is used in other parts
of the code.

Yours,
Laurenz Albe


Attachment

Re: Improve shutdown during online backup, take 4

From
Magnus Hagander
Date:
Albe Laurenz wrote:
> Magnus Hagander wrote:
> > This doesn't look like our normal coding standards, and should
> > probably be changed:
> > +     if (0 != stat(BACKUP_LABEL_FILE, &stat_buf))
> >
> > (there's a number of similar places)
>
> I see. Lacking guidelines, I now copied how stat(2) is used in
> other parts of the code.

Really? I thought we didn't do that :), and I recall having my own
patches bounced for the same reason.

Just to be sure, we are talking about the
if (0 == foo)
vs
if (foo == 0)

thing, right? If not, then I wasn't clear enough in my note :-(

//Magnus

Re: Improve shutdown during online backup, take 4

From
Magnus Hagander
Date:
Albe Laurenz wrote:
> Magnus Hagander wrote:
> > This doesn't look like our normal coding standards, and should
> > probably be changed:
> > +     if (0 != stat(BACKUP_LABEL_FILE, &stat_buf))
> >
> > (there's a number of similar places)
>
> I see. Lacking guidelines, I now copied how stat(2) is used in
> other parts of the code.

Applied with some minor changes to the error messages to make them
closer to our guidelines. (With my track record, it's not entirely
unlikely that someone will fix them further though :-P)

//Magnus

Re: Improve shutdown during online backup, take 4

From
Alvaro Herrera
Date:
Magnus Hagander wrote:

> Applied with some minor changes to the error messages to make them
> closer to our guidelines. (With my track record, it's not entirely
> unlikely that someone will fix them further though :-P)

I think the messages should not have a newline in the middle.

Also, I am wondering if in PM_WAIT_BACKUP mode we should accept new
connections from superusers only.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Improve shutdown during online backup, take 4

From
Magnus Hagander
Date:
Alvaro Herrera wrote:
> Magnus Hagander wrote:
>
> > Applied with some minor changes to the error messages to make them
> > closer to our guidelines. (With my track record, it's not entirely
> > unlikely that someone will fix them further though :-P)
>
> I think the messages should not have a newline in the middle.

Are you talking about the one in pg_ctl? We have other messages in
pg_ctl that already do this, so I figured it was ok there...


> Also, I am wondering if in PM_WAIT_BACKUP mode we should accept new
> connections from superusers only.

Oh yeah, bleh, that reminds me that I should send off the couple of
comments for further improvements I thought of.

One was just this.

Another was, should we expose the cancel backup as a SQL level command?

//Magnus

Re: Improve shutdown during online backup, take 4

From
"Albe Laurenz"
Date:
Alvaro Herrera wrote:
> I think the messages should not have a newline in the middle.
>
> Also, I am wondering if in PM_WAIT_BACKUP mode we should accept new
> connections from superusers only.

I spent some thought on that.
You'd need to wait until the user is authenticated before you can determine
if he/she is a superuser and may connect (otherwise I think it would be a
security leak that enables any attacker to find out whether a given user is
a superuser without knowing the password).

By that time the server process is already forked.
I couldn't see a way to check the postmaster state at that point,
so I decided not to try and keep it simple.

If you have any ideas how I could do such a check reasonably,
I'd be happy to try it, because basically I think it would be the
right thing.

Yours,
Laurenz Albe

Re: Improve shutdown during online backup, take 4

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Alvaro Herrera wrote:
>> I think the messages should not have a newline in the middle.

> Are you talking about the one in pg_ctl? We have other messages in
> pg_ctl that already do this, so I figured it was ok there...

I concur that the messages added to pg_ctl are bizarrely formatted.
Why would you put a newline in the middle of a sentence, when you
could equally well emit something like

WARNING: online backup mode is active.
Shutdown will not complete until pg_stop_backup() is called.

While we're on the subject, the messages added to xlog.c do not
follow the style guidelines: in particular, errdetail should be
a complete sentence, and the WARNING is trying to stuff independent
thoughts into one message.  I'd probably do

errmsg("online backup mode cancelled"),
errdetail("\"%s\" was renamed to \"%s\".", ...

errmsg("online backup mode was not cancelled"),
errdetail("Failed to rename \"%s\" to \"%s\": %m", ...

Lastly, the changes to pmdie's SIGINT handling seem quite bogus.
Don't you need to transition into WAIT_BACKUP rather than WAIT_BACKENDS
state in that case too?  Shouldn't you do CancelBackup *before*
PostmasterStateMachine?  The thing screams of race conditions.

            regards, tom lane

Re: Improve shutdown during online backup, take 4

From
"Albe Laurenz"
Date:
Tom Lane wrote:
> I concur that the messages added to pg_ctl are bizarrely formatted.
> Why would you put a newline in the middle of a sentence, when you
> could equally well emit something like
>
> WARNING: online backup mode is active.
> Shutdown will not complete until pg_stop_backup() is called.
>
> While we're on the subject, the messages added to xlog.c do not
> follow the style guidelines: in particular, errdetail should be
> a complete sentence, and the WARNING is trying to stuff independent
> thoughts into one message.  I'd probably do
>
> errmsg("online backup mode cancelled"),
> errdetail("\"%s\" was renamed to \"%s\".", ...
>
> errmsg("online backup mode was not cancelled"),
> errdetail("Failed to rename \"%s\" to \"%s\": %m", ...

Attached is a patch that changes the messages along these lines.
Thanks!

> Lastly, the changes to pmdie's SIGINT handling seem quite bogus.
> Don't you need to transition into WAIT_BACKUP rather than WAIT_BACKENDS
> state in that case too?  Shouldn't you do CancelBackup *before*
> PostmasterStateMachine?  The thing screams of race conditions.

I suspect there must be a misunderstanding.
You cannot really mean that the postmaster should enter WAIT_BACKUP
state on a fast shutdown request.

Sure, CancelBackup could be called earlier. It doesn't do much more than
rename a file.
My reason for calling it late was that backup mode should really only be
cancelled if we manage to shutdown cleanly, and this is not clear until
the last child is gone.

I cannot see a race condition, except maybe the quite unlikely case
that two instances of CancelBackup might run concurrently, and it
*might* happen that one of them finds that "backup_label" is present but
fails to remove it, because the other one already did.
That would lead to a bogus "backup mode not canceled" message.

Are you referring to that or is there something more fundamental
I fail to see?

Yours,
Laurenz Albe

Attachment

Re: Improve shutdown during online backup, take 4

From
Magnus Hagander
Date:
Albe Laurenz wrote:
> Tom Lane wrote:
> > I concur that the messages added to pg_ctl are bizarrely formatted.
> > Why would you put a newline in the middle of a sentence, when you
> > could equally well emit something like
> >
> > WARNING: online backup mode is active.
> > Shutdown will not complete until pg_stop_backup() is called.
> >
> > While we're on the subject, the messages added to xlog.c do not
> > follow the style guidelines: in particular, errdetail should be
> > a complete sentence, and the WARNING is trying to stuff independent
> > thoughts into one message.  I'd probably do
> >
> > errmsg("online backup mode cancelled"),
> > errdetail("\"%s\" was renamed to \"%s\".", ...
> >
> > errmsg("online backup mode was not cancelled"),
> > errdetail("Failed to rename \"%s\" to \"%s\": %m", ...
>
> Attached is a patch that changes the messages along these lines.
> Thanks!

Hmm. I've preivously been told not to use "Failed to" but instead use
"Could not"... Didn't notice that Tom used the other one in his
suggestion.

Tom (or someone else) - can you comment on if I misunderstood that
recommendation earlier, or if it still holds? I'll hold back a commit
until someone has commented on it :-)

Also, from this patch, you removed the %m part - I have re-added that
before commit.

(I will not for now comment on the rest of the mail, I'll leave that to
Tom)

//Magnus

Re: Improve shutdown during online backup, take 4

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Hmm. I've preivously been told not to use "Failed to" but instead use
> "Could not"... Didn't notice that Tom used the other one in his
> suggestion.

> Tom (or someone else) - can you comment on if I misunderstood that
> recommendation earlier, or if it still holds?

Oh, yes, "could not" is better.  My mistake.

            regards, tom lane

Re: Improve shutdown during online backup, take 4

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > Hmm. I've preivously been told not to use "Failed to" but instead
> > use "Could not"... Didn't notice that Tom used the other one in his
> > suggestion.
>
> > Tom (or someone else) - can you comment on if I misunderstood that
> > recommendation earlier, or if it still holds?
>
> Oh, yes, "could not" is better.  My mistake.

Ok, thanks. Committed new error messages as such then.

//Magnus

Re: Improve shutdown during online backup, take 4

From
Tom Lane
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> Tom Lane wrote:
>> Lastly, the changes to pmdie's SIGINT handling seem quite bogus.
>> Don't you need to transition into WAIT_BACKUP rather than WAIT_BACKENDS
>> state in that case too?  Shouldn't you do CancelBackup *before*
>> PostmasterStateMachine?  The thing screams of race conditions.

> I suspect there must be a misunderstanding.
> You cannot really mean that the postmaster should enter WAIT_BACKUP
> state on a fast shutdown request.

Why not?  It'll fall out of the state again immediately in
PostmasterStateMachine, no, if we do a CancelBackup here?  I don't think
these two paths of control should be any more different than really
necessary.

> Sure, CancelBackup could be called earlier. It doesn't do much more than
> rename a file.
> My reason for calling it late was that backup mode should really only be
> cancelled if we manage to shutdown cleanly, and this is not clear until
> the last child is gone.

Well, if there were anything conditional about calling it, then maybe
that argument would hold some water, but the way you've got it here it
*will* get called anyway, just after the PostmasterStateMachine call
... except in the corner case where there were no child processes left
and so PostmasterStateMachine manages to decide it's ready to exit().
So far from implementing the logic you suggest, this arrangement never
gets it right, and has a race condition case in which it gets it exactly
backward.

The other reason for the remark about race conditions is that the
PostmasterStateMachine call should absolutely be the last thing that
pmdie() does --- putting anything after it is wrong, especially things
that might alter the PM state, as indeed CancelBackup could.  The reason
for having that in the signal handler is to cover the possibility that
no such call will occur immediately when we return to the wait loop.
In general all of the condition handlers in postmaster.c should be of
the form "respond to the immediate condition, and then let
PostmasterStateMachine decide if there's anything else to do".

If you actually want the behavior you propose, then the only correct way
to implement it is to embed it into the state machine logic, ie, do the
CancelBackup inside PostmasterStateMachine in some state transition
taken after the last child is gone.

            regards, tom lane

Re: Improve shutdown during online backup, take 4

From
"Albe Laurenz"
Date:
Tom Lane wrote:
>>> Lastly, the changes to pmdie's SIGINT handling seem quite bogus.
>>> Don't you need to transition into WAIT_BACKUP rather than WAIT_BACKENDS
>>> state in that case too?  Shouldn't you do CancelBackup *before*
>>> PostmasterStateMachine?  The thing screams of race conditions.
>
>> I suspect there must be a misunderstanding.
>> You cannot really mean that the postmaster should enter WAIT_BACKUP
>> state on a fast shutdown request.
>
> Why not?  It'll fall out of the state again immediately in
> PostmasterStateMachine, no, if we do a CancelBackup here?  I don't think
> these two paths of control should be any more different than really
> necessary.

We cannot call CancelBackup there because that's exactly the state
in which a smart shutdown waits for a superuser to issue pg_stop_backup().

> Well, if there were anything conditional about calling it, then maybe
> that argument would hold some water, but the way you've got it here it
> *will* get called anyway, just after the PostmasterStateMachine call
[...]

I see.

> The other reason for the remark about race conditions is that the
> PostmasterStateMachine call should absolutely be the last thing that
> pmdie() does --- putting anything after it is wrong, especially things
> that might alter the PM state, as indeed CancelBackup could.

I see that too. Thanks for explaining.

> If you actually want the behavior you propose, then the only correct way
> to implement it is to embed it into the state machine logic, ie, do the
> CancelBackup inside PostmasterStateMachine in some state transition
> taken after the last child is gone.

I've attached a patch that works for me. I hope I got it right.

Yours,
Laurenz Albe

Attachment

Re: Improve shutdown during online backup, take 4

From
Tom Lane
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> Tom Lane wrote:
>> Why not?  It'll fall out of the state again immediately in
>> PostmasterStateMachine, no, if we do a CancelBackup here?

> We cannot call CancelBackup there because that's exactly the state
> in which a smart shutdown waits for a superuser to issue pg_stop_backup().

Er, I was complaining about the fast-shutdown code path, not the
smart-shutdown one.

            regards, tom lane

Re: Improve shutdown during online backup, take 4

From
"Albe Laurenz"
Date:
Tom Lane wrote:
>>> Why not?  It'll fall out of the state again immediately in
>>> PostmasterStateMachine, no, if we do a CancelBackup here?
>>
>> We cannot call CancelBackup there because that's exactly the state
>> in which a smart shutdown waits for a superuser to issue pg_stop_backup().
>
> Er, I was complaining about the fast-shutdown code path, not the
> smart-shutdown one.

Yes, I got that.
So you suggest that a fast shutdown *first* calls CancelBackup and
then goes into WAIT_BACKUP state, right?

That should work, but isn't it better if backup_label is removed
only if we know we're going to shutdown cleanly?

I think the patch in http://archives.postgresql.org/pgsql-patches/2008-04/msg00458.php
should meet your justified complaints, by calling CancelBackup immediately
before PostmasterExit(0) in PostmasterStateMachine.

Yours,
Laurenz Albe

Re: Improve shutdown during online backup, take 4

From
Tom Lane
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> That should work, but isn't it better if backup_label is removed
> only if we know we're going to shutdown cleanly?

Why?  That seems like an entirely arbitrary specification.

            regards, tom lane

Re: Improve shutdown during online backup, take 4

From
"Albe Laurenz"
Date:
Tom Lane wrote:
> Why?  That seems like an entirely arbitrary specification.

My resoning is that I think of smart/fast/immediate shutdown
as three different things.

For an immediate shutdown/crash thought it was best not to modify
anything to facilitate an analysis of the problem.

A fast shutdown that fails will end up as a crash or immediate
shutdown.


If you think that is is not important to only cancel backup mode
if we are sure that the shutdown will be clean, we might as well
also cancel online backup mode during an immediate shutdown.

In that case, I'd agree that the call to CancelBackup() could be moved
to WAIT_BACKUP (and executed only if it is no smart shutdown).
It would have the advantage of having all backup mode related
code in postmaster.c concentrated in one spot.


Make a suggestion, and I'll implement it that way.

Yours,
Laurenz Albe

Re: Improve shutdown during online backup, take 4

From
Tom Lane
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> Tom Lane wrote:
>> If you actually want the behavior you propose, then the only correct way
>> to implement it is to embed it into the state machine logic, ie, do the
>> CancelBackup inside PostmasterStateMachine in some state transition
>> taken after the last child is gone.

> I've attached a patch that works for me. I hope I got it right.

Applied with additional cleanup.  You hadn't thought very carefully
about additional state transitions that would have to be introduced
into the postmaster state machine to support a new state --- for
example, as coded a SIGINT delivered to the postmaster after SIGTERM
would fail to do anything at all, when of course it really ought to
force us into fast shutdown.  Also, it's not really that hard to
disallow non-superusers from connecting in PM_WAIT_BACKUP state.

            regards, tom lane

Re: Improve shutdown during online backup, take 4

From
"Albe Laurenz"
Date:
Tom Lane wrote:
> > I've attached a patch that works for me. I hope I got it right.
>
> Applied with additional cleanup.  You hadn't thought very carefully
> about additional state transitions that would have to be introduced
> into the postmaster state machine to support a new state --- for
> example, as coded a SIGINT delivered to the postmaster after SIGTERM
> would fail to do anything at all, when of course it really ought to
> force us into fast shutdown.  Also, it's not really that hard to
> disallow non-superusers from connecting in PM_WAIT_BACKUP state.

Thank you for helping.
You know, this is my first patch for server code. I know that I
still have a lot to learn until I grok how all that works
together.

Yours,
Laurenz Albe