Re: Stefan's bug (was: max_standby_delay considered harmful) - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Stefan's bug (was: max_standby_delay considered harmful)
Date
Msg-id AANLkTil_TBmQlZUZEM4aEGAYXy5vBAtpnM1M6TPdcSzV@mail.gmail.com
Whole thread Raw
In response to Re: Stefan's bug (was: max_standby_delay considered harmful)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Stefan's bug (was: max_standby_delay considered harmful)
List pgsql-hackers
On Wed, May 26, 2010 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 24, 2010 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, May 24, 2010 at 10:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> This looks pretty reasonable to me, but I guess I feel like it would
>>> be better to drive the CancelBackup() decision off of whether we've
>>> ever reached PM_RUN rather than consulting XLogCtl.  It just feels
>>> cleaner to me to drive all of the postmaster decisions off of the same
>>> signalling mechanism rather than having a separate one (that only
>>> works because it's used very late in shutdown when we theoretically
>>> don't need a lock) just for this one case.
>>
>> Okay, how about the attached patch? It uses the postmaster-local flag
>> "ReachedEndOfRecovery" (better name?) instead of XLogCtl one.
>
> I've committed part of this patch, with the naming change that Tom
> suggested.

Thanks!

> The parts I haven't committed are:
>
> 1. I don't see why we need to reset ReachedEndOfRecovery starting over
> from PM_NO_CHILDREN.  It seems to me that once we reach PM_RUN, we
> CAN'T go back to needing the backup label file, even if we have a
> subsequent backend crash.  If I'm wrong, please let me know why and
> I'll go put this back (with an appropriate comment).

That reset has nothing to do with cancellation of the backup mode.
I just added it since postmaster might use ReachedNormalRunning for
another purpose in the future. For now, I have no objection not to
add it.

> 2. The changes to avoid launching WALReceiver except during certain
> PM_* states.  It seems fairly sensible, but what is the case where
> adding this logic prevents a problem?

The problem is that shutdown would get stuck when walreceiver is
invoked in PM_WAIT_BACKEND state. Imagine the following scenario:

1. Fast shutdown is requested just before the startup process calls  RequestXLogStreaming() which is the function to
requestpostmaster  to invoke walreceiver. 

2. pmdie() sends SIGTERM to the startup process, but not walreceiver  because it's not been started yet. Then, pmdie()
switchesthe  state into PM_WAIT_BACKENDS. 

3. The startup process goes through RequestXLogStreaming() and requests  postmaster to start walreceiver before
processingSIGTERM sent from  pmdie(). Then the startup process exits, and postmaster invokes  walreceiver in
PM_WAIT_BACKENDSstate. 

4. Once postmaster has reached PM_WAIT_BACKENDS, there is no way to send  SIGTERM to walreceiver. OTOH, postmaster
cannotadvance the state from  PM_WAIT_BACKENDS until walreceiver has gone away. So shutdown gets  stuck. 

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: functional call named notation clashes with SQL feature
Next
From: Fujii Masao
Date:
Subject: Re: Stefan's bug (was: max_standby_delay considered harmful)