Thread: Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

[moved to -hackers]

On tor, 2009-11-12 at 22:37 +0200, Marko Kreen wrote:
> On 11/12/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Marko Kreen <markokr@gmail.com> writes:
> > > You talked about blocking in quickdie(), but you'd need
> >  > to block in elog().
> >
> >  I'm not really particularly worried about that case.  By that logic,
> >  we could not use quickdie at all, because any part of the system
> >  might be doing something that wouldn't survive being interrupted.
> 
> Not really - we'd need to care only about parts that quickdie()
> (or any other signal handler) wants to use.  Which basically means
> elog() only.
> 
> OK, full elog() is a beast, but why would SIGQUIT handler need full
> elog()?  How about we export minimal log-writing function and make
> that signal-safe - that is, drop message if already active.  This
> will excange potential crash/deadlock with lost msg which seems
> slightly better behaviour.

Yeah, on reflection, calling elog in the SIGQUIT handler is just waiting
for trouble.  The call could block for any number of reasons, because
there is a boatload of functionality that comes with a logging call.  In
the overall scheme of things, you don't really lose much if you just
delete the call altogether, because in the event that it's called the
postmaster will already have logged that it is going to kill all
children.  Or there ought to be some kind of alarm that would abort the
thing if it takes too long.



Peter Eisentraut <peter_e@gmx.net> writes:
> Yeah, on reflection, calling elog in the SIGQUIT handler is just waiting
> for trouble.  The call could block for any number of reasons, because
> there is a boatload of functionality that comes with a logging call.  In
> the overall scheme of things, you don't really lose much if you just
> delete the call altogether, because in the event that it's called the
> postmaster will already have logged that it is going to kill all
> children.  Or there ought to be some kind of alarm that would abort the
> thing if it takes too long.

Well, the point of that call is not to log the event. The point is to
tell the client why it's losing its connection.  Admittedly there are
assorted corner cases where that would fail, but it works well enough
often enough that I don't want to just delete the attempt.
        regards, tom lane


Here is a set of patches to address this issue.

The first one is a small refactoring of the signal setting portability
business.

The second one fixes the SIGQUIT handler inadvertently unblocking
SIGQUIT within itself.

The third one installs an alarm so that if the ereport() call in
quickdie() doesn't finish after 60 seconds, it skips it and finishes up.
The precise logic of this could be debated, but it more or less appears
to get the job done.

Attachment
Peter Eisentraut escribió:
> Here is a set of patches to address this issue.
> 
> The first one is a small refactoring of the signal setting portability
> business.

This one looks like should be applied immediately to get some buildfarm
coverage (and alpha3)



-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Peter Eisentraut <peter_e@gmx.net> writes:
> Here is a set of patches to address this issue.
> The first one is a small refactoring of the signal setting portability
> business.

OK

> The second one fixes the SIGQUIT handler inadvertently unblocking
> SIGQUIT within itself.

OK

> The third one installs an alarm so that if the ereport() call in
> quickdie() doesn't finish after 60 seconds, it skips it and finishes up.
> The precise logic of this could be debated, but it more or less appears
> to get the job done.

I'm not too happy with depending on alarm(), which according to the
pgbench sources is not portable to Windows.  The postmaster does
something equivalent using enable_sig_alarm(); can we use that?

A different line of thought is that this still doesn't fix the problem
because you're depending on a fair amount of libc infrastructure that
might not be too reentrant (and the same objection could probably be
mounted against enable_sig_alarm).  It would be better for the
postmaster to be in charge of enforcing the timeout, and have it issue
SIGKILL against children that don't die fast enough to suit it.
        regards, tom lane