Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown - Mailing list pgsql-bugs
From | Thomas Munro |
---|---|
Subject | Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown |
Date | |
Msg-id | CAEepm=1_COFxB1c+Kso=JSH9CZtsJWLOji2d4EJB8E15ory3tw@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown (David Kohn <djk447@gmail.com>) |
Responses |
Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
|
List | pgsql-bugs |
On Fri, Feb 2, 2018 at 11:54 AM, David Kohn <djk447@gmail.com> wrote: > On Mon, Jan 29, 2018 at 11:08 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> the real question is: why on earth aren't the wait loops responding to >> SIGINT and SIGTERM? I wonder if there might be something funky about >> parallel query + statement timeouts. > > Agreed. Seems like a backtrace wouldn't help much. I saw the other thread > with similar cancellation issues a couple notes that might help: > 1) I also have a lateral select inside of a view there. seems doubtful that > the lateral has anything to do with it, but in case that could be it, > thought I'd pass that along. I don't think that's directly relevant -- the cause of BtreePage is Parallel Index Scan, which you could prevent by setting max_parallel_workers_per_gather = 0 or min_parallel_index_scan_size = '5TB' (assuming your indexes are smaller than that). The 10.2 release that fixes the parallel btree scan bug is due in a couple of days. > 2) Are there any settings that could potentially help with this? for > instance, this isn't on a replica, so max_standby_archive_delay wouldn't > more forcefully (potentially) cancel a query, is there anything similar that > could work here? as you noted we've already set a statement timeout, so it > isn't responding to that, but it does get cancelled when another (hung) > process is SIGKILL-ed. When that happens the db goes into recovery mode - so > is it being sent SIGKILL at that point as well? Or is it some other signal > that is a little less invasive? Probably not, but thought I'd ask. As Robert mentioned on that other thread, there is a place where the leader waits for backends to exit while ignoring interrupts. It's be good to check if that's happening here, and also figure out what exactly is happening with the workers (and any other backends that may be involved in this tangle, for example autovacuum). Can you get strack traces for all the relevant processes? https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD You mentioned that one or more uninterruptible backend was in mq_putmessage() in pqmp.c (wait event "MessageQueuePutMessage"). I can't immediately see how that can be uninterruptible (for a while I wondered if interrupting it was causing it to recurse to try to report an error, but I don't think that's it). If you kill -QUIT a worker that's waiting there, you'll get this: background worker "parallel worker" (PID 46693) exited with exit code 2 terminating any other active server processes If you kill -KILL you'll get: background worker "parallel worker" (PID 46721) was terminated by signal 9: Killed terminating any other active server processes Either way, your cluster restarts with a load of "terminating connection because of crash of another server process" errors. It seems problematic that if the leader becomes non-interruptible while the workers are blocked on a full message queue, there is apparently no way to orchestrate a graceful stop. -- Thomas Munro http://www.enterprisedb.com
pgsql-bugs by date: