Re: Query running for very long time (server hanged) with parallel append - Mailing list pgsql-hackers

From David Kohn
Subject Re: Query running for very long time (server hanged) with parallel append
Date
Msg-id CAJhMaBh4uUh--XvaPtiE9OPPWC3E-aXgcnysz38sSGOLRuyT5w@mail.gmail.com
Whole thread Raw
In response to Re: Query running for very long time (server hanged) with parallel append  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Please forgive my inexperience with the codebase, but as the guy who reported this bugger: https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org#151724453314.1238.409882538067070269@wrigleys.postgresql.org, I thought I'd follow your hints, as it's causing some major issues for me. So some notes on what is happening for me and some (possibly silly) thoughts on why:

On Fri, Feb 2, 2018 at 10:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
Is it getting stuck here?

    /*
     * We can't finish transaction commit or abort until all of the workers
     * have exited.  This means, in particular, that we can't respond to
     * interrupts at this stage.
     */
    HOLD_INTERRUPTS();
    WaitForParallelWorkersToExit(pcxt);
    RESUME_INTERRUPTS();
I am seeing unkillable queries with the client backend in IPC-BgWorkerShutdown wait event, which, it appears to me can only happen inside of bgworker.c at WaitForBackgroundWorkerShutdown which is called by parallel.c at WaitForParallelWorkersToExit inside of DestroyParallelContext, which seems like it should be called when there is a statement timeout (which I think is happening in at least some of my cases) so it would make sense that this is where the problem is.

My background workers are in the IPC-MessageQueuePutMessage event, which appears to only be possible from pqmq.c  at mq_putmessage , directly following the WaitLatch, there is a CHECK_FOR_INTERRUPTS(); so, if it's waiting on that latch and never gets to the interrupt that would explain things. Also it appears that it sends a signal to the leader process a few lines before starting to wait, which is supposed to tell the leader to come read messages off the queue. If the leader gets to WaitForParallelWorkersToExit at the wrong time and ends up waiting on that event, I could see how they would both end up waiting for the other and never finishing. 

The thing is that DestroyParallelContext seems to be detaching from the queues, but if the worker hit the wait step before the leader detaches from the queue does it have any way of knowing that? 

Anyway, I'm entirely unsure of my analysis here, but thought I'd offer something to help speed this along. 

Best,
David Kohn

 

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Boolean partitions syntax
Next
From: Tom Lane
Date:
Subject: Re: RelOptInfo -> Relation