Thread: Archiver behavior at shutdown

Archiver behavior at shutdown

From
Tom Lane
Date:
The problem complained of in bug #3843 was something I'd noticed a few
days ago and meant to fix.  ISTM the recent change to have the archiver
outlive the postmaster was incompletely thought out, and we really need
to take two steps back and reconsider, if we want to fix it so it works.
As of CVS HEAD, the behavior after the postmaster receives a shutdown
request and has seen its last regular-backend child die is:

1. Issue SIGUSR2 to the bgwriter to make it start a shutdown checkpoint.

2. Immediately SIGQUIT the archiver.

3. Back at the main loop, restart the archiver, if it exits before the
bgwriter finishes the checkpoint (as is highly likely).

4. After postmaster exits, archiver eventually notices it's gone,
but that takes a good while since we are guaranteed to be just
starting the delay loop inside the fresh archiver process.

This is just plain dumb.  Aside from the uselessness of killing a
process only to immediately re-fork it, we should not be SIGQUIT'ing
the archiver during normal operation --- that might abort an archive
copy partway through, and it's anybody's guess whether the
archive_command script is smart enough to deal with that situation.

ISTM the postmaster should leave the archiver alone at the
PM_WAIT_BACKENDS -> PM_SHUTDOWN transition, and instead send it
a WAKEN signal (SIGUSR1) when it sees normal exit of the bgwriter.
That will afford an opportunity to archive anything that was pushed
out during the shutdown checkpoint.  A possibly better alternative,
since the archiver isn't using SIGUSR2, is to send SIGUSR2 which
would be defined as "archive what you can and then quit".  (In that
case, the !PostmasterIsAlive exit would be taken only in the event
of a true postmaster crash, which is improbable.)

Another case that seems not to have been thought about very much is
whether the archiver should behave differently in a "mode fast" shutdown
as opposed to "mode smart".  I would argue that it should not, since
both cases are supposed to be equally safe for your data.  I notice
though that the postmaster suppresses forwarding of WAKEN signals
after entering FastShutdown mode; that doesn't seem like a good idea.

Another case that needs some revisiting is the archiver's response
to SIGTERM, which is currently SIG_IGN.  Since the postmaster will never
send it SIGTERM, we should assume that receipt of SIGTERM means that
init is telling us we have N seconds left before system shutdown.
Is it a good idea to continue archiving in that situation?  I doubt it
--- it seems like we are just asking to get SIGKILL'd partway through a
copy step.  I suggest that the response to SIGTERM ought to be to finish
out the current copy operation (if possible) but then quit without
initiating any new ones.

And while I'm griping: I see that the pgstats process is SIGQUIT'ed at
the entry to PM_SHUTDOWN state, same as the archiver.  This likewise
seems out of step with current reality, since the bgwriter now sends
messages to the stats collector.  This step needs to be moved to after
bgwriter termination, too.

Comments?  Anyone see any other bugs here?
        regards, tom lane


Re: Archiver behavior at shutdown

From
Simon Riggs
Date:
On Thu, 2007-12-27 at 15:29 -0500, Tom Lane wrote:

> As of CVS HEAD, the behavior after the postmaster receives a shutdown
> request and has seen its last regular-backend child die is:

...based upon limitations of the existing system. We have been
SIGQUIT'ing the archiver, and there is a comment there to say how
important it is that we *do not* try to finish processing before we
quit. If you think that comment is wrong, thats OK by me: I can't recall
the reasoning there, or even if it was my own.

> ISTM the postmaster should leave the archiver alone at the
> PM_WAIT_BACKENDS -> PM_SHUTDOWN transition, and instead send it
> a WAKEN signal (SIGUSR1) when it sees normal exit of the bgwriter.
> That will afford an opportunity to archive anything that was pushed
> out during the shutdown checkpoint.  A possibly better alternative,
> since the archiver isn't using SIGUSR2, is to send SIGUSR2 which
> would be defined as "archive what you can and then quit".  (In that
> case, the !PostmasterIsAlive exit would be taken only in the event
> of a true postmaster crash, which is improbable.)

Sounds good.

> Another case that needs some revisiting is the archiver's response
> to SIGTERM, which is currently SIG_IGN.  Since the postmaster will never
> send it SIGTERM, we should assume that receipt of SIGTERM means that
> init is telling us we have N seconds left before system shutdown.
> Is it a good idea to continue archiving in that situation?  I doubt it
> --- it seems like we are just asking to get SIGKILL'd partway through a
> copy step.  I suggest that the response to SIGTERM ought to be to finish
> out the current copy operation (if possible) but then quit without
> initiating any new ones.

Not sure about that. If there are outstanding files to archive, then it
probably is important to try to archive them. Mostly this won't be the
case, but if this was, for example a simple switchover between a primary
and a warm standby then it might result in data loss.

If you see problems with archive_commands that don't correctly reset
themselves after an error then we should document how to, rather than
just *try* to avoid it. 

> And while I'm griping: I see that the pgstats process is SIGQUIT'ed at
> the entry to PM_SHUTDOWN state, same as the archiver.  This likewise
> seems out of step with current reality, since the bgwriter now sends
> messages to the stats collector.  This step needs to be moved to after
> bgwriter termination, too.

Sounds good.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Archiver behavior at shutdown

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> ...based upon limitations of the existing system. We have been
> SIGQUIT'ing the archiver, and there is a comment there to say how
> important it is that we *do not* try to finish processing before we
> quit. If you think that comment is wrong, thats OK by me: I can't recall
> the reasoning there, or even if it was my own.

That comment is clearly wrong --- it applies to the SIGTERM situation.

>> Another case that needs some revisiting is the archiver's response
>> to SIGTERM, which is currently SIG_IGN.  Since the postmaster will never
>> send it SIGTERM, we should assume that receipt of SIGTERM means that
>> init is telling us we have N seconds left before system shutdown.
>> Is it a good idea to continue archiving in that situation?  I doubt it
>> --- it seems like we are just asking to get SIGKILL'd partway through a
>> copy step.  I suggest that the response to SIGTERM ought to be to finish
>> out the current copy operation (if possible) but then quit without
>> initiating any new ones.

> Not sure about that. If there are outstanding files to archive, then it
> probably is important to try to archive them. Mostly this won't be the
> case, but if this was, for example a simple switchover between a primary
> and a warm standby then it might result in data loss.

A simple switchover ought to be done by bringing down the postmaster,
not the whole machine.

The real question here is whether it's sane to try to do archiving on a
machine that is in the midst of shutdown.  As an example, it's quite
likely that NFS mounts are going to go away sometime between SIGTERM and
SIGKILL, if they haven't done so already.
        regards, tom lane


Re: Archiver behavior at shutdown

From
Alvaro Herrera
Date:
Tom Lane wrote:

> ISTM the postmaster should leave the archiver alone at the
> PM_WAIT_BACKENDS -> PM_SHUTDOWN transition, and instead send it
> a WAKEN signal (SIGUSR1) when it sees normal exit of the bgwriter.
> That will afford an opportunity to archive anything that was pushed
> out during the shutdown checkpoint.

What does postmaster do then?  Sleep until archiver is done, or exit
immediately and hope that the archiver goes away as soon as it finishes?
If the former, then we open the possibility that postmaster lives far
too long before system shutdown decides to SIGKILL it.  If the latter,
then a subsequent postmaster start could initiate a second archiver
process which would cause issues with whatever the first archiver is
doing.

I think your proposal to handle SIGTERM could also be used whenever
postmaster has been asked for shutdown (except smart shutdown,
perhaps?):

> I suggest that the response to SIGTERM ought to be to finish
> out the current copy operation (if possible) but then quit without
> initiating any new ones.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Archiver behavior at shutdown

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> What does postmaster do then?  Sleep until archiver is done, or exit
> immediately and hope that the archiver goes away as soon as it finishes?

I think it can just exit immediately, particularly if we invent the
variant signal for "archive what you can and then quit".

> If the former, then we open the possibility that postmaster lives far
> too long before system shutdown decides to SIGKILL it.  If the latter,
> then a subsequent postmaster start could initiate a second archiver
> process which would cause issues with whatever the first archiver is
> doing.

That's a problem that the archiver itself should fix (perhaps it needs
its own lockfile).  Consider kill -9 on the postmaster followed by
starting a fresh postmaster --- you have the same problem, and there's
nothing much the postmaster can do about it.

> I think your proposal to handle SIGTERM could also be used whenever
> postmaster has been asked for shutdown (except smart shutdown,
> perhaps?):

>> I suggest that the response to SIGTERM ought to be to finish
>> out the current copy operation (if possible) but then quit without
>> initiating any new ones.

No, because during normal shutdown we'd like the archiver to copy away
*all* available segments, not just one.
        regards, tom lane


Re: Archiver behavior at shutdown

From
Simon Riggs
Date:
On Thu, 2007-12-27 at 17:29 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > What does postmaster do then?  Sleep until archiver is done, or exit
> > immediately and hope that the archiver goes away as soon as it finishes?
> 
> I think it can just exit immediately, particularly if we invent the
> variant signal for "archive what you can and then quit".
> 
> > If the former, then we open the possibility that postmaster lives far
> > too long before system shutdown decides to SIGKILL it.  If the latter,
> > then a subsequent postmaster start could initiate a second archiver
> > process which would cause issues with whatever the first archiver is
> > doing.
> 
> That's a problem that the archiver itself should fix (perhaps it needs
> its own lockfile). 

http://archives.postgresql.org/pgsql-hackers/2006-05/msg00920.php

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Archiver behavior at shutdown

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Thu, 2007-12-27 at 17:29 -0500, Tom Lane wrote:
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>> then a subsequent postmaster start could initiate a second archiver
>>> process which would cause issues with whatever the first archiver is
>>> doing.
>> 
>> That's a problem that the archiver itself should fix (perhaps it needs
>> its own lockfile). 

> http://archives.postgresql.org/pgsql-hackers/2006-05/msg00920.php

I thought that sounded familiar ;-).  What was the outcome of that
discussion?  No patch for this ever got applied AFAICS.  The patch
as posted had a few issues, per the thread, and I don't see a followup
version.  (The alleged replacement patch did something else entirely.)
        regards, tom lane


Re: Archiver behavior at shutdown

From
Simon Riggs
Date:
On Thu, 2007-12-27 at 18:54 -0500, Tom Lane wrote: 
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Thu, 2007-12-27 at 17:29 -0500, Tom Lane wrote:
> >> Alvaro Herrera <alvherre@commandprompt.com> writes:
> >>> then a subsequent postmaster start could initiate a second archiver
> >>> process which would cause issues with whatever the first archiver is
> >>> doing.
> >> 
> >> That's a problem that the archiver itself should fix (perhaps it needs
> >> its own lockfile). 
> 
> > http://archives.postgresql.org/pgsql-hackers/2006-05/msg00920.php
> 
> I thought that sounded familiar ;-).  

As you say, I'm beginning to know where the bodies are buried...

> What was the outcome of that
> discussion?  No patch for this ever got applied AFAICS.  The patch
> as posted had a few issues, per the thread, and I don't see a followup
> version.  (The alleged replacement patch did something else entirely.)

We applied a one line change in preference to the lockfile approach for
8.2, requested by you, agreed to by me and applied by Bruce.


This would be the behaviour I would have, if I had a blank canvas:

- keep archiver alive at shutdown, rather than bouncing it

- send SIGUSR2 to do finish-up and close, just like bgwriter

- put a lockfile in for the archiver that prevents a new archiver from
starting, but everything else comes up OK. In postmaster if PgArchPID ==
0 then we check for archiver.pid, if present, read it and send a SIGUSR2
to it. If rc = ESRCH then process no present, so start up new archiver

- lets keep archiving, if there is work to do, right up until the last
possible moment, even if the postmaster has gone

- ensure people understand that an archive_command call can be
interrupted and may need to handle the consequences if the command is
not atomic

With those changes the use cases would look like this...

System Shutdown
System shuts down, postmaster shuts down, archiver works furiously until
the end trying to archive things away. Archiver gets caught half way
through copy, so crashes, leaving archiver.pid. Subsequent startup sees
archiver.pid, postmaster reads file to get pid, then sends signal to
archiver to see if it is still alive, it isn't so remove archiver.pid
and allow next archiver to start. First call to archive_command handles
partially copied file in archive.

Server Crash
Something takes down server, archiver stays up trying to archive things
away. Crash recovery kicks in and finishes very quickly, new archiver
tries to start up but cannot because first archiver is still working. At
the end of its cycle, first archiver goes away and allows new archiver
to start and continue operating.

Server Restart
Server shuts down, but there is work to do so first archiver stays
around to finish it. Newly started server tries to start archiver but
cannot because of pid file. Reads pid file, sends signal. Archiver is
already shutting down, so continues its cycle and then quites. New
archiver starts up under new postmaster.

...but that's too much change for me to personally stomach at this stage
of 8.3. My main issue is that I don't have the time to be able to do a
retest of start/stop/restart/crash behaviour and catching all the side
cases is fairly hard, and yet also critical at this stage of play. 

For me, the behaviour is close enough now, with the main issue being the
additional wait at the end of pgarch_MainLoop(). It's been there since
8.2, so a simple fix there would be non-invasive and backpatchable also.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Archiver behavior at shutdown

From
Greg Smith
Date:
On Sat, 29 Dec 2007, Simon Riggs wrote:

> System Shutdown
> System shuts down, postmaster shuts down, archiver works furiously until
> the end trying to archive things away. Archiver gets caught half way
> through copy, so crashes, leaving archiver.pid. Subsequent startup sees
> archiver.pid, postmaster reads file to get pid, then sends signal to
> archiver to see if it is still alive, it isn't so remove archiver.pid
> and allow next archiver to start.

Isn't it possible some other process may have started with that pid if the 
database server was down for long enough?  In that case sending a signal 
presuming it's the archive process that used to have that pid might be bad 
form.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Archiver behavior at shutdown

From
Simon Riggs
Date:
On Fri, 2007-12-28 at 20:20 -0500, Greg Smith wrote:
> On Sat, 29 Dec 2007, Simon Riggs wrote:
> 
> > System Shutdown
> > System shuts down, postmaster shuts down, archiver works furiously until
> > the end trying to archive things away. Archiver gets caught half way
> > through copy, so crashes, leaving archiver.pid. Subsequent startup sees
> > archiver.pid, postmaster reads file to get pid, then sends signal to
> > archiver to see if it is still alive, it isn't so remove archiver.pid
> > and allow next archiver to start.
> 
> Isn't it possible some other process may have started with that pid if the 
> database server was down for long enough?  In that case sending a signal 
> presuming it's the archive process that used to have that pid might be bad 
> form.

I think you've emphasised my point that me rushing this in the time I
have available is not going to improve matters for the group.

My original one line change described on bug 3843 seems like the best
solution for 8.3.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Archiver behavior at shutdown

From
Fujii Masao
Date:
Simon Riggs wrote:

> My original one line change described on bug 3843 seems like the best
> solution for 8.3.
> 

+1
Is this change in time for RC1?

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
TEL (03)5860-5115
FAX (03)5463-5490


Re: Archiver behavior at shutdown

From
Simon Riggs
Date:
On Fri, 2008-01-04 at 17:28 +0900, Fujii Masao wrote:
> Simon Riggs wrote:
>
> > My original one line change described on bug 3843 seems like the best
> > solution for 8.3.
> >
>
> +1
> Is this change in time for RC1?

Patch attached.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Attachment

Re: [PATCHES] Archiver behavior at shutdown

From
Simon Riggs
Date:
On Sat, 2008-01-05 at 12:09 +0000, Simon Riggs wrote:
> On Fri, 2008-01-04 at 17:28 +0900, Fujii Masao wrote:
> > Simon Riggs wrote:
> >
> > > My original one line change described on bug 3843 seems like the best
> > > solution for 8.3.
> > >
> >
> > +1
> > Is this change in time for RC1?
>
> Patch attached.

Not sure why this hasn't being applied yet for 8.3

We have a small problem, a fix and a user voting for the fix.

Can we discuss, briefly?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: [PATCHES] Archiver behavior at shutdown

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Not sure why this hasn't being applied yet for 8.3

Because it doesn't fix the problem ... which is that the postmaster
kills the archiver (and the stats collector too) at what is now the
wrong point in the shutdown sequence.

            regards, tom lane

Re: [PATCHES] Archiver behavior at shutdown

From
Simon Riggs
Date:
On Wed, 2008-01-09 at 10:15 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Not sure why this hasn't being applied yet for 8.3
>
> Because it doesn't fix the problem ... which is that the postmaster
> kills the archiver (and the stats collector too) at what is now the
> wrong point in the shutdown sequence.

The original bug report states the problem as being that the archiver
stays for a noticeable period after postmaster shutdown. My patch fixes
that very safely.

It doesn't fix your request for redesign, which I accept is still
pending and I've explained why.

I don't see any reason to leave the original problem hanging just
because the fix isn't as wide as we might really like.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com