Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown - Mailing list pgsql-bugs
From | David Kohn |
---|---|
Subject | Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown |
Date | |
Msg-id | CAJhMaBiVNRyzxpkNC_6vLyv71PUGp4cWREdOPO4GGg_tYBz9Xw@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
|
List | pgsql-bugs |
It appears that the BTree problem is actually much less common than the other and I'm perfectly happy that there will be a bug fix for that coming out in the next few days. Thanks for the work on that. The other one does appear to be different, so I dove into the code a bit to try to figure it out. Unsure of my reasoning, but perhaps you can illuminate.
On Mon, Feb 5, 2018 at 4:13 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
As Robert mentioned on that other thread, there is a place where the
leader waits for backends to exit while ignoring interrupts. It's be
good to check if that's happening here, and also figure out what
exactly is happening with the workers (and any other backends that may
be involved in this tangle, for example autovacuum). Can you get
strack traces for all the relevant processes?
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
You mentioned that one or more uninterruptible backend was in
mq_putmessage() in pqmp.c (wait event "MessageQueuePutMessage"). I
can't immediately see how that can be uninterruptible (for a while I
wondered if interrupting it was causing it to recurse to try to report
an error, but I don't think that's it).
I haven't yet had a chance to do a stacktrace, but from my reading of the code, the only time the worker will get a wait event of MessageQueuePutMessage is at line 171 of pqmq.c, and the only time a leader would get a message of BgWorkerShutdown would be at line 1160 of bgworker.c. Now, given that I am ending up in a state where I have a leader and one or more workers in those states (and as far as I can tell after a statement timeout) it seems to me that the following series of events could cause it (though I haven't quite figured out what code path is taken on a cancel and whether this is plausible):
1) The leader gets canceled due to statement timeout, so it effectively does a rollback calling AtEOXact_Parallel which calls DestroyParallelContext() without first calling WaitForParallelWorkersToFinish() because we want to immediately end query without getting any more results. So we detach from the error queue, and a bit later detach from any message queues, we haven't checked for interrupts to process any parallel messages that have come in, we then enter our uninterruptible state while we wait for the workers to exit, and do not get any notices about messages that we would have gotten in the CHECK_FOR_INTERRUPTS() Call in WaitForBackgroundWorkerShutdown.
2) The background worker is either trying to send a message on the normal queue, and has hit the waitlatch there, or is trying to send a message on the error queue and hit the waitlatch because the error message is long (this might be the more plausible explanation, I would have potentially long error messages and we did just attempt to terminate the background worker, I do not know what happens if the worker attempts to send a message and ends up in the WaitLatch between when the terminate message was sent and when the leader detaches from the error queue, perhaps a SetLatch before detaching from the error queue would help?)
3) The worker is uninterruptible because it is waiting on a latch from the parent process before hitting the CHECK_FOR_INTERRUPTS below the latch, the leader process is uninterruptible because it is in that spot where it holds interrupts, so they each wait for the other? (or perhaps if it is the error queue case, the worker is uninterruptible because it is already being forcibly terminated, but is waiting inside of that call?)
I'm not clear on when we do a SetLatch on those message queues during a cancel of parallel workers, and a number of other things that could definitely invalidate this analysis, but I think there could be a plausible explanation in there somewhere.Would a timeout on the WaitLatch inside of pqmq.c (a relatively large one, say 5 seconds) be too expensive? It seems like it could solve this problem, but I'm not sure if the overhead is so high that that would be a significant slowdown in normal parallel execution or cause problems if processing a message were delayed for longer than the timeout.
Thanks for the help on this, I hope this is helpful and do let me know if a stacktrace or anything else would be helpful on my end.
Best,
David
pgsql-bugs by date: