Home > mailing lists

Re: Query running for very long time (server hanged) with parallel append - Mailing list pgsql-hackers

From	David Kohn
Subject	Re: Query running for very long time (server hanged) with parallel append
Date	February 3, 2018 05:17:49
Msg-id	CAJhMaBh4uUh--XvaPtiE9OPPWC3E-aXgcnysz38sSGOLRuyT5w@mail.gmail.com Whole thread Raw
In response to	Re: Query running for very long time (server hanged) with parallel append (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

Please forgive my inexperience with the codebase, but as the guy who reported this bugger: https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org#151724453314.1238.409882538067070269@wrigleys.postgresql.org, I thought I'd follow your hints, as it's causing some major issues for me. So some notes on what is happening for me and some (possibly silly) thoughts on why:

On Fri, Feb 2, 2018 at 10:16 AM Robert Haas <robertmhaas@gmail.com> wrote:

Is it getting stuck here?

/*
* We can't finish transaction commit or abort until all of the workers
* have exited. This means, in particular, that we can't respond to
* interrupts at this stage.
*/
HOLD_INTERRUPTS();
WaitForParallelWorkersToExit(pcxt);
RESUME_INTERRUPTS();

I am seeing unkillable queries with the client backend in IPC-BgWorkerShutdown wait event, which, it appears to me can only happen inside of bgworker.c at WaitForBackgroundWorkerShutdown which is called by parallel.c at WaitForParallelWorkersToExit inside of DestroyParallelContext, which seems like it should be called when there is a statement timeout (which I think is happening in at least some of my cases) so it would make sense that this is where the problem is.

My background workers are in the IPC-MessageQueuePutMessage event, which appears to only be possible from pqmq.c at mq_putmessage , directly following the WaitLatch, there is a CHECK_FOR_INTERRUPTS(); so, if it's waiting on that latch and never gets to the interrupt that would explain things. Also it appears that it sends a signal to the leader process a few lines before starting to wait, which is supposed to tell the leader to come read messages off the queue. If the leader gets to WaitForParallelWorkersToExit at the wrong time and ends up waiting on that event, I could see how they would both end up waiting for the other and never finishing.

The thing is that DestroyParallelContext seems to be detaching from the queues, but if the worker hit the wait step before the leader detaches from the queue does it have any way of knowing that?

Anyway, I'm entirely unsure of my analysis here, but thought I'd offer something to help speed this along.

Best,
David Kohn

pgsql-hackers by date:

From: Tom Lane
Date: 03 February 2018, 05:04:44
Subject: Re: Boolean partitions syntax

From: Tom Lane
Date: 03 February 2018, 06:06:01
Subject: Re: RelOptInfo -> Relation

Re: Query running for very long time (server hanged) with parallel append - Mailing list pgsql-hackers

Previous

Next