Re: Hot standby, recovery infra - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Hot standby, recovery infra
Date
Msg-id 49A6E1A9.5020901@enterprisedb.com
Whole thread Raw
In response to Re: Hot standby, recovery infra  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Hot standby, recovery infra  (Fujii Masao <masao.fujii@gmail.com>)
Re: Hot standby, recovery infra  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Fujii Masao wrote:
> On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> That whole area was something I was leaving until last, since immediate
>> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
>> this before Christmas, briefly).
> 
> This problem remains in current HEAD. I mean, immediate shutdown
> may be unable to kill the startup process because system() which
> executes restore_command ignores SIGQUIT while waiting.
> When I tried immediate shutdown during recovery, only the startup
> process survived. This is undesirable behavior, I think.

Yeah, we need to fix that.

> The following code should be added into RestoreArchivedFile()?
> 
> ----
> if (WTERMSIG(rc) == SIGQUIT)
>        exit(2);
> ----

I don't see how that helps, as we already have this in there:
signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
ereport(signaled ? FATAL : DEBUG2,    (errmsg("could not restore file \"%s\" from archive: return code %d",
xlogfname,rc)));
 

which means we already ereport(FATAL) if the restore command dies with 
SIGQUIT.

I think the real problem here is that pg_standby traps SIGQUIT. The 
startup process doesn't receive the SIGQUIT because it's in system(), 
and pg_standby doesn't propagate it to the startup process either 
because it traps it.

I think we should simply remove the signal handler for SIGQUIT from 
pg_standby. Or will that lead to core dump by default? In that case, we 
need pg_standby to exit(128) or similar, so that RestoreArchivedFile 
understands that the command was killed by a signal.

Another approach is to check that the postmaster is still alive, like we  do in walwriter and bgwriter:
    /*     * Emergency bailout if postmaster has died.  This is to avoid the     * necessity for manual cleanup of all
postmasterchildren.     */    if (!PostmasterIsAlive(true))        exit(1);
 

However, I'm afraid there's a race condition with that. If we do that 
right after system(), postmaster might've signaled us but not exited 
yet. We could check that in the main loop, but if we wrongly interpret 
the exit of the recovery command as a "file not found - go ahead and 
start up", the damage might be done by the time we notice that the 
postmaster is gone.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: xpath processing brain dead
Next
From: Robert Haas
Date:
Subject: Re: xpath processing brain dead