Thread: Hot standby fails if any backend crashes
I'm currently working with Duncan Rance's test case for bug #6425, and I am observing a very nasty behavior in HEAD: once one of the hot-standby query backends crashes, the standby postmaster SIGQUIT's all its children and then just quits itself, with no log message and apparently no effort to restart. Surely this is not intended? The log shows TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 735) 2012-02-02 18:02:39.985 EST 29363 LOG: server process (PID 15238) was terminated by signal 6: Aborted 2012-02-02 18:02:39.985 EST 29363 DETAIL: Failed process was running: SELECT * FROM repro_02_ref; 2012-02-02 18:02:39.985 EST 29363 LOG: terminating any other active server processes 2012-02-02 18:02:39.985 EST 15214 WARNING: terminating connection because of crash of another server process 2012-02-02 18:02:39.985 EST 15214 DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 2012-02-02 18:02:39.985 EST 15214 HINT: In a moment you should be able to reconnect to the database and repeat your command. 2012-02-02 18:02:39.985 EST 15213 WARNING: terminating connection because of crash of another server process 2012-02-02 18:02:39.985 EST 15213 DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 2012-02-02 18:02:39.985 EST 15213 HINT: In a moment you should be able to reconnect to the database and repeat your command. [ repeat the above for what I assume are all the child processes ] ... and then nothing. The standby postmaster is no longer running and there are no log messages from it after the "terminating any other active server processes" one. No core dump from it, either. regards, tom lane
I wrote: > I'm currently working with Duncan Rance's test case for bug #6425, and > I am observing a very nasty behavior in HEAD: once one of the > hot-standby query backends crashes, the standby postmaster SIGQUIT's > all its children and then just quits itself, with no log message and > apparently no effort to restart. Surely this is not intended? I looked through postmaster.c and found that the cause of this is pretty obvious: if the startup process exits with any non-zero status, we assume that represents an unrecoverable error condition, and set RecoveryError which causes the postmaster to exit silently as soon as its last child is gone. But we do this even if the reason the startup process did exit(1) is that we sent it SIGQUIT as a result of a crash of some other process. Of course this logic dates from a time where the startup process could not have any siblings, so when it was written, such a thing was impossible. I think saner behavior might only require this change: /* * Any unexpected exit (including FATAL exit) of the startup * process is treated as acrash, except that we don't want to * reinitialize. */ if (!EXIT_STATUS_0(exitstatus)) { - RecoveryError = true; + if (!FatalError) + RecoveryError = true; HandleChildCrash(pid, exitstatus, _("startup process")); continue; } plus suitable comment adjustments of course. Haven't tested this yet though. It's a bit disturbing that nobody has reported this from the field yet. Seems to imply that hot standby isn't being used much. regards, tom lane
On Fri, Feb 3, 2012 at 1:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I'm currently working with Duncan Rance's test case for bug #6425, and >> I am observing a very nasty behavior in HEAD: once one of the >> hot-standby query backends crashes, the standby postmaster SIGQUIT's >> all its children and then just quits itself, with no log message and >> apparently no effort to restart. Surely this is not intended? > > I looked through postmaster.c and found that the cause of this is pretty > obvious: if the startup process exits with any non-zero status, we > assume that represents an unrecoverable error condition, and set > RecoveryError which causes the postmaster to exit silently as soon as > its last child is gone. But we do this even if the reason the startup > process did exit(1) is that we sent it SIGQUIT as a result of a crash of > some other process. Of course this logic dates from a time where the > startup process could not have any siblings, so when it was written, > such a thing was impossible. > > I think saner behavior might only require this change: > > /* > * Any unexpected exit (including FATAL exit) of the startup > * process is treated as a crash, except that we don't want to > * reinitialize. > */ > if (!EXIT_STATUS_0(exitstatus)) > { > - RecoveryError = true; > + if (!FatalError) > + RecoveryError = true; > HandleChildCrash(pid, exitstatus, > _("startup process")); > continue; > } > > plus suitable comment adjustments of course. Haven't tested this yet > though. Looks good to me. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Feb 2, 2012 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's a bit disturbing that nobody has reported this from the field yet. > Seems to imply that hot standby isn't being used much. I have seen this, but didn't get to dig in, as I thought it could be a problem from other things done outside Postgres (it also came up in #6200, but I didn't mention it). Consider it retroactively reported. -- fdr
On Fri, Feb 3, 2012 at 4:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think saner behavior might only require this change: > > /* > * Any unexpected exit (including FATAL exit) of the startup > * process is treated as a crash, except that we don't want to > * reinitialize. > */ > if (!EXIT_STATUS_0(exitstatus)) > { > - RecoveryError = true; > + if (!FatalError) > + RecoveryError = true; > HandleChildCrash(pid, exitstatus, > _("startup process")); > continue; > } > > plus suitable comment adjustments of course. Haven't tested this yet > though. Looks good, will test. > It's a bit disturbing that nobody has reported this from the field yet. > Seems to imply that hot standby isn't being used much. There are many people I know using it in production for more than a year now. Either they haven't seen it or they haven't reported it to us. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services