Re: SIGQUIT handling, redux - Mailing list pgsql-hackers

From Tom Lane
Subject Re: SIGQUIT handling, redux
Date
Msg-id 251830.1599757015@sss.pgh.pa.us
Whole thread Raw
In response to Re: SIGQUIT handling, redux  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: SIGQUIT handling, redux  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 9, 2020 at 10:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> bgworker_die (SIGTERM)
>> Calls ereport(FATAL).  This is surely not any safer than, say,
>> quickdie().  No, it's worse, because at least that won't try
>> to go out via proc_exit().

> I think bgworker_die() is pretty much a terrible idea.

Yeah.  It'd likely be better to insist that bgworkers handle SIGTERM
the same way backends do, via CHECK_FOR_INTERRUPTS.  Not sure how big
a change that would be.

> I think that the only way this could actually
> be safe is if you have a background worker that never uses ereport()
> itself, so that the ereport() in the signal handler can't be
> interrupting one that's already happening.

That doesn't begin to make it safe, because quite aside from anything
that happens in elog.c, we're then going to go out via proc_exit()
which will invoke who knows what.

>> StandbyDeadLockHandler (from SIGALRM)
>> StandbyTimeoutHandler (ditto)
>> Calls CancelDBBackends, which just for starters tries to acquire
>> an LWLock.  I think the only reason we've gotten away with this
>> for this long is the high probability that by the time either
>> timeout fires, we're going to be blocked on a semaphore.

> Yeah, I'm not sure these are so bad. In fact, in the deadlock case, I
> believe the old coding was designed to make sure we *had to* be
> blocked on a semaphore, but I'm not sure whether that's still true.

Looking at this closer, the only code that could get interrupted
by these timeouts is a call to ProcWaitForSignal, which is

void
ProcWaitForSignal(uint32 wait_event_info)
{
    (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
                     wait_event_info);
    ResetLatch(MyLatch);
    CHECK_FOR_INTERRUPTS();
}

Interrupting the latch operations might be safe enough,
although now I'm casting a side-eye at Munro's recent
changes to share WaitEvent state all over the place.
I wonder whether blocking on an LWLock inside the
signal handler will break an interrupted WaitLatch.

Also, man that CHECK_FOR_INTERRUPTS() looks like trouble.
Could we take that out?

            regards, tom lane



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Collation versioning
Next
From: "Jonathan S. Katz"
Date:
Subject: Re: PostgreSQL 13 Release Timeline