Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
Date
Msg-id CAEepm=1_COFxB1c+Kso=JSH9CZtsJWLOji2d4EJB8E15ory3tw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown  (David Kohn <djk447@gmail.com>)
Responses Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown  (David Kohn <djk447@gmail.com>)
List pgsql-bugs
On Fri, Feb 2, 2018 at 11:54 AM, David Kohn <djk447@gmail.com> wrote:
> On Mon, Jan 29, 2018 at 11:08 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> the real question is: why on earth aren't the wait loops responding to
>> SIGINT and SIGTERM?  I wonder if there might be something funky about
>> parallel query + statement timeouts.
>
> Agreed. Seems like a backtrace wouldn't help much. I saw the other thread
> with similar cancellation issues a couple notes that might help:
> 1) I also have a lateral select inside of a view there. seems doubtful that
> the lateral has anything to do with it, but in case that could be it,
> thought I'd pass that along.

I don't think that's directly relevant -- the cause of BtreePage is
Parallel Index Scan, which you could prevent by setting
max_parallel_workers_per_gather = 0 or min_parallel_index_scan_size =
'5TB' (assuming your indexes are smaller than that).  The 10.2 release
that fixes the parallel btree scan bug is due in a couple of days.

> 2) Are there any settings that could potentially help with this? for
> instance, this isn't on a replica, so max_standby_archive_delay wouldn't
> more forcefully (potentially) cancel a query, is there anything similar that
> could work here? as you noted we've already set a statement timeout, so it
> isn't responding to that, but it does get cancelled when another (hung)
> process is SIGKILL-ed. When that happens the db goes into recovery mode - so
> is it being sent SIGKILL at that point as well? Or is it some other signal
> that is a little less invasive? Probably not, but thought I'd ask.

As Robert mentioned on that other thread, there is a place where the
leader waits for backends to exit while ignoring interrupts.  It's be
good to check if that's happening here, and also figure out what
exactly is happening with the workers (and any other backends that may
be involved in this tangle, for example autovacuum).  Can you get
strack traces for all the relevant processes?

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

You mentioned that one or more uninterruptible backend was in
mq_putmessage() in pqmp.c (wait event "MessageQueuePutMessage").  I
can't immediately see how that can be uninterruptible (for a while I
wondered if interrupting it was causing it to recurse to try to report
an error, but I don't think that's it).  If you kill -QUIT a worker
that's waiting there, you'll get this:

  background worker "parallel worker" (PID 46693) exited with exit code 2
  terminating any other active server processes

If you kill -KILL you'll get:

  background worker "parallel worker" (PID 46721) was terminated by
signal 9: Killed
  terminating any other active server processes

Either way, your cluster restarts with a load of "terminating
connection because of crash of another server process" errors. It
seems problematic that if the leader becomes non-interruptible while
the workers are blocked on a full message queue, there is apparently
no way to orchestrate a graceful stop.

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #15052: unresponsive client
Next
From: Dheeraj
Date:
Subject: Re: BUG #15049: Initdb.exe failing to create DB