Thread: Revised patch for fixing archiver shutdown behavior

Revised patch for fixing archiver shutdown behavior

From
Tom Lane
Date:
The attached patch fixes archiver shutdown in what seems to me to be
a sane way.  With the patch, we send SIGQUIT to the archiver only for
panic-stop situations (backend crash or immediate-mode shutdown).
This is important because the postmaster is coded to send SIGQUIT
to the entire process group, meaning we'd also ungracefully terminate
any currently-running archive command, which does not seem like a good
idea for normal exit.  Instead, the shutdown protocol is that *after*
the bgwriter has quit, we send SIGUSR1 (ie, the normal archiver wakeup
signal) to the archiver.  This ensures that it will do a normal
archiving cycle after the last possible creation of WAL entries.
The archiver is also modified to fall out of its wait loop more
quickly when the postmaster has died (this is the same as Simon's
previous one-liner patch), and to not exit until it has completed
a full archive cycle since the postmaster died.  The latter eliminates
a race condition that existed before --- depending on timing, the
CVS-HEAD archiver might or might not do a final archiving cycle.

I also tweaked the postmaster so that the stats collector isn't
told to quit until after the bgwriter finishes; this ensures that
any stats reported from the bgwriter will be collected.

One point needing discussion is that the postmaster is currently
coded not to send SIGUSR1 to the archiver if a fast-mode shutdown
is under way.  I duplicated that in the added SIGUSR1 signal here,
but I wonder whether it is sane or not.  Comments?

            regards, tom lane


Attachment

Re: Revised patch for fixing archiver shutdown behavior

From
Alvaro Herrera
Date:
Tom Lane wrote:
> The attached patch fixes archiver shutdown in what seems to me to be
> a sane way.  With the patch, we send SIGQUIT to the archiver only for
> panic-stop situations (backend crash or immediate-mode shutdown).
> This is important because the postmaster is coded to send SIGQUIT
> to the entire process group, meaning we'd also ungracefully terminate
> any currently-running archive command, which does not seem like a good
> idea for normal exit.  Instead, the shutdown protocol is that *after*
> the bgwriter has quit, we send SIGUSR1 (ie, the normal archiver wakeup
> signal) to the archiver.  This ensures that it will do a normal
> archiving cycle after the last possible creation of WAL entries.

Hmm, so the postmaster is gone during the last archiving cycle?  What
about syslogger?  Is the archiver able to log stuff in the last cycle?

The comment in line 2180 seems a bit bogus ...?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Revised patch for fixing archiver shutdown behavior

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Hmm, so the postmaster is gone during the last archiving cycle?  What
> about syslogger?  Is the archiver able to log stuff in the last cycle?

The logger is no problem --- it quits when it sees EOF on its input
pipe, which means that all upstream processes are gone.

> The comment in line 2180 seems a bit bogus ...?

Yeah, that could use a bit more work I guess, since "normal children"
sounds like it would refer to more than just backends.

            regards, tom lane

Re: Revised patch for fixing archiver shutdown behavior

From
Tom Lane
Date:
I wrote:
> One point needing discussion is that the postmaster is currently
> coded not to send SIGUSR1 to the archiver if a fast-mode shutdown
> is under way.  I duplicated that in the added SIGUSR1 signal here,
> but I wonder whether it is sane or not.  Comments?

After chewing on that for awhile, I decided it was bogus.  If we are
going to have a policy that the archiver gets a chance to archive
everything, that shouldn't depend on fast vs. smart shutdown; those
alternatives determine whether we kick clients out ungracefully,
not whether we take extra risks with committed data.

I think we should allow the archiver to finish out its tasks fully
in all non-crash cases except one: if we got SIGTERM from init.
In that case there's a very great risk of being SIGKILL'd before
we can finish archiving.  The postmaster cannot easily tell whether
its SIGTERM came from init or not, but we can drive this off the
archiver itself getting SIGTERM'd.  I propose that if the archiver
receives SIGTERM, it should cease to issue any new archive commands,
but just wait till it sees the postmaster exit.  (It can't exit
right away, since there's a race condition: the postmaster might
not have been SIGTERM'd yet, and might therefore spawn a new
archiver, which would have no idea it's unsafe to do anything more.)

There's an obvious failure mode in that, which is that a randomly
issued SIGTERM to the archiver would shut down archiving indefinitely.
We can guard against that with a timeout: the archiver should exit
a minute or two after being SIGTERM'd, even if the postmaster is still
there.  That should certainly be enough delay to avoid the race
condition, and if in fact everything is still hunky-dory the
postmaster will immediately spawn a new archiver.

Hence, attached revised patch ...

            regards, tom lane


Attachment

Re: Revised patch for fixing archiver shutdown behavior

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Hence, attached revised patch ...

Looks good.

Something I'm still wondering is about the archiver/logger combination.
What happens if a postmaster is stopped by the user and the archiver is
still running, and the user starts a new postmaster?  This would launch
a new archiver and logger; and there are now two loggers possibly
writing to the same files, and truncated log lines could occur.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Revised patch for fixing archiver shutdown behavior

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Something I'm still wondering is about the archiver/logger combination.
> What happens if a postmaster is stopped by the user and the archiver is
> still running, and the user starts a new postmaster?  This would launch
> a new archiver and logger; and there are now two loggers possibly
> writing to the same files, and truncated log lines could occur.

I'm not nearly as worried about that as I am about the prospect of
two concurrent archivers :-(

There was discussion of having a lock file for the archiver, but
it's still an open issue.  I'm not sure how to solve the problem
of stale lockfiles --- unlike the postmaster, the archiver can't
assume that it's the only live process with the postgres userid.
For example, after a system crash-and-restart, it's entirely
likely that the PID formerly held by the archiver is now held
by the bgwriter, making the lockfile (if any) look live.

Maybe we should go back to the plan of having the postmaster
wait for the archiver to exit.

            regards, tom lane

Re: Revised patch for fixing archiver shutdown behavior

From
Alvaro Herrera
Date:
Tom Lane wrote:

> There was discussion of having a lock file for the archiver, but
> it's still an open issue.  I'm not sure how to solve the problem
> of stale lockfiles --- unlike the postmaster, the archiver can't
> assume that it's the only live process with the postgres userid.
> For example, after a system crash-and-restart, it's entirely
> likely that the PID formerly held by the archiver is now held
> by the bgwriter, making the lockfile (if any) look live.

Ah, right :-(

> Maybe we should go back to the plan of having the postmaster
> wait for the archiver to exit.

Yeah, that seems the safest to me -- the problem is that it complicates
the shutdown sequence a fair bit, because postmaster must act
differently depending on whether archiving is enabled or not: wait for
bgwriter exit if disabled, or for archiver exit otherwise.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Revised patch for fixing archiver shutdown behavior

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> Maybe we should go back to the plan of having the postmaster
>> wait for the archiver to exit.

> Yeah, that seems the safest to me -- the problem is that it complicates
> the shutdown sequence a fair bit, because postmaster must act
> differently depending on whether archiving is enabled or not: wait for
> bgwriter exit if disabled, or for archiver exit otherwise.

Given the recent changes to make the postmaster act as a state machine,
I don't think this is really a big deal --- it's just one more state.
The bigger part is that the archiver can't wait for postmaster exit.
We'll need a proper shutdown signal for the archiver, but since it's
not using SIGUSR2 that can be commandeered easily.  So it'd be like

    SIGUSR1 -> do an archive cycle
    SIGUSR2 -> do an archive cycle and exit
    no postmaster -> just exit

The rationale for the last is that it's a crash situation, and
furthermore there's a risk of someone starting a new postmaster
and a conflicting archiver.  So we should put back the behavior
my last patch removed of aborting archiving immediately on
postmaster death.

I'll respin my patch this way...

            regards, tom lane

Re: Revised patch for fixing archiver shutdown behavior

From
Simon Riggs
Date:
On Thu, 2008-01-10 at 10:13 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Something I'm still wondering is about the archiver/logger combination.
> > What happens if a postmaster is stopped by the user and the archiver is
> > still running, and the user starts a new postmaster?  This would launch
> > a new archiver and logger; and there are now two loggers possibly
> > writing to the same files, and truncated log lines could occur.
>
> I'm not nearly as worried about that as I am about the prospect of
> two concurrent archivers :-(

How strange, I was just looking at that particular possibility when your
mail arrived.

The earlier patch looks good, but I see you've reverted the
PostmasterIsAlive() check in pgarch_ArchiverCopyLoop(). That was put in
to prevent multiple archivers. Can we keep that?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Revised patch for fixing archiver shutdown behavior

From
Simon Riggs
Date:
On Thu, 2008-01-10 at 12:28 -0300, Alvaro Herrera wrote:

> Yeah, that seems the safest to me -- the problem is that it complicates
> the shutdown sequence a fair bit, because postmaster must act
> differently depending on whether archiving is enabled or not: wait for
> bgwriter exit if disabled, or for archiver exit otherwise.

If there's nothing to write, archiver will exit quickly, so in most
cases they will look the same.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Revised patch for fixing archiver shutdown behavior

From
Simon Riggs
Date:
On Thu, 2008-01-10 at 10:49 -0500, Tom Lane wrote:
> So we should put back the behavior
> my last patch removed of aborting archiving immediately on
> postmaster death.

Excellent

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Revised patch for fixing archiver shutdown behavior

From
Tom Lane
Date:
I wrote:
> I'll respin my patch this way...

Third time's the charm?

            regards, tom lane


Attachment

Re: Revised patch for fixing archiver shutdown behavior

From
Simon Riggs
Date:
On Thu, 2008-01-10 at 14:08 -0500, Tom Lane wrote:
> I wrote:
> > I'll respin my patch this way...
>
> Third time's the charm?

If we do a shutdown immediate on the postmaster *after* the bgwriter has
written a shutdown checkpoint, do we have any record that there was a
panic stop? Do we enter recovery in that case? I think the answers are
yes and no, but just checking.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Revised patch for fixing archiver shutdown behavior

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> If we do a shutdown immediate on the postmaster *after* the bgwriter has
> written a shutdown checkpoint, do we have any record that there was a
> panic stop? Do we enter recovery in that case? I think the answers are
> yes and no, but just checking.

Yeah, the postmaster would complain at exit, but the pg_control state
is already SHUTDOWN at that point.  There'd be a log entry showing
abnormal archiver exit if the panic had had any actual effect on it.

            regards, tom lane