Thread: Hot standby fails if any backend crashes

Hot standby fails if any backend crashes

From
Tom Lane
Date:
I'm currently working with Duncan Rance's test case for bug #6425, and
I am observing a very nasty behavior in HEAD: once one of the
hot-standby query backends crashes, the standby postmaster SIGQUIT's
all its children and then just quits itself, with no log message and
apparently no effort to restart.  Surely this is not intended?  The
log shows

TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 735)
2012-02-02 18:02:39.985 EST 29363 LOG:  server process (PID 15238) was terminated by signal 6: Aborted
2012-02-02 18:02:39.985 EST 29363 DETAIL:  Failed process was running: SELECT * FROM repro_02_ref;
2012-02-02 18:02:39.985 EST 29363 LOG:  terminating any other active server processes
2012-02-02 18:02:39.985 EST 15214 WARNING:  terminating connection because of crash of another server process
2012-02-02 18:02:39.985 EST 15214 DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2012-02-02 18:02:39.985 EST 15214 HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
2012-02-02 18:02:39.985 EST 15213 WARNING:  terminating connection because of crash of another server process
2012-02-02 18:02:39.985 EST 15213 DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory.
 
2012-02-02 18:02:39.985 EST 15213 HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
[ repeat the above for what I assume are all the child processes ]

... and then nothing.  The standby postmaster is no longer running and
there are no log messages from it after the "terminating any other
active server processes" one.  No core dump from it, either.
        regards, tom lane


Re: Hot standby fails if any backend crashes

From
Tom Lane
Date:
I wrote:
> I'm currently working with Duncan Rance's test case for bug #6425, and
> I am observing a very nasty behavior in HEAD: once one of the
> hot-standby query backends crashes, the standby postmaster SIGQUIT's
> all its children and then just quits itself, with no log message and
> apparently no effort to restart.  Surely this is not intended?

I looked through postmaster.c and found that the cause of this is pretty
obvious: if the startup process exits with any non-zero status, we
assume that represents an unrecoverable error condition, and set
RecoveryError which causes the postmaster to exit silently as soon as
its last child is gone.  But we do this even if the reason the startup
process did exit(1) is that we sent it SIGQUIT as a result of a crash of
some other process.  Of course this logic dates from a time where the
startup process could not have any siblings, so when it was written,
such a thing was impossible.

I think saner behavior might only require this change:
           /*            * Any unexpected exit (including FATAL exit) of the startup            * process is treated as
acrash, except that we don't want to            * reinitialize.            */           if (!EXIT_STATUS_0(exitstatus))
         {
 
-               RecoveryError = true;
+               if (!FatalError)
+                   RecoveryError = true;               HandleChildCrash(pid, exitstatus,
_("startup process"));               continue;           }
 

plus suitable comment adjustments of course.  Haven't tested this yet
though.

It's a bit disturbing that nobody has reported this from the field yet.
Seems to imply that hot standby isn't being used much.
        regards, tom lane


Re: Hot standby fails if any backend crashes

From
Fujii Masao
Date:
On Fri, Feb 3, 2012 at 1:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> I'm currently working with Duncan Rance's test case for bug #6425, and
>> I am observing a very nasty behavior in HEAD: once one of the
>> hot-standby query backends crashes, the standby postmaster SIGQUIT's
>> all its children and then just quits itself, with no log message and
>> apparently no effort to restart.  Surely this is not intended?
>
> I looked through postmaster.c and found that the cause of this is pretty
> obvious: if the startup process exits with any non-zero status, we
> assume that represents an unrecoverable error condition, and set
> RecoveryError which causes the postmaster to exit silently as soon as
> its last child is gone.  But we do this even if the reason the startup
> process did exit(1) is that we sent it SIGQUIT as a result of a crash of
> some other process.  Of course this logic dates from a time where the
> startup process could not have any siblings, so when it was written,
> such a thing was impossible.
>
> I think saner behavior might only require this change:
>
>            /*
>             * Any unexpected exit (including FATAL exit) of the startup
>             * process is treated as a crash, except that we don't want to
>             * reinitialize.
>             */
>            if (!EXIT_STATUS_0(exitstatus))
>            {
> -               RecoveryError = true;
> +               if (!FatalError)
> +                   RecoveryError = true;
>                HandleChildCrash(pid, exitstatus,
>                                 _("startup process"));
>                continue;
>            }
>
> plus suitable comment adjustments of course.  Haven't tested this yet
> though.

Looks good to me.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Hot standby fails if any backend crashes

From
Daniel Farina
Date:
On Thu, Feb 2, 2012 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's a bit disturbing that nobody has reported this from the field yet.
> Seems to imply that hot standby isn't being used much.

I have seen this, but didn't get to dig in, as I thought it could be a
problem from other things done outside Postgres (it also came up in
#6200, but I didn't mention it).

Consider it retroactively reported.

-- 
fdr


Re: Hot standby fails if any backend crashes

From
Simon Riggs
Date:
On Fri, Feb 3, 2012 at 4:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I think saner behavior might only require this change:
>
>            /*
>             * Any unexpected exit (including FATAL exit) of the startup
>             * process is treated as a crash, except that we don't want to
>             * reinitialize.
>             */
>            if (!EXIT_STATUS_0(exitstatus))
>            {
> -               RecoveryError = true;
> +               if (!FatalError)
> +                   RecoveryError = true;
>                HandleChildCrash(pid, exitstatus,
>                                 _("startup process"));
>                continue;
>            }
>
> plus suitable comment adjustments of course.  Haven't tested this yet
> though.

Looks good, will test.

> It's a bit disturbing that nobody has reported this from the field yet.
> Seems to imply that hot standby isn't being used much.

There are many people I know using it in production for more than a year now.

Either they haven't seen it or they haven't reported it to us.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services