Re: [HACKERS] parallel.c oblivion of worker-startup failures - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] parallel.c oblivion of worker-startup failures |
Date | |
Msg-id | CA+TgmobihJVjjYXreB56CJ9jQUCC4PeuYCw03EwPWBA-yM-ubg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] parallel.c oblivion of worker-startup failures (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [HACKERS] parallel.c oblivion of worker-startup failures
|
List | pgsql-hackers |
On Tue, Jan 23, 2018 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Hmm, I think that case will be addressed because tuple queues can > detect if the leader is not attached. It does in code path > shm_mq_receive->shm_mq_counterparty_gone. In > shm_mq_counterparty_gone, it can detect if the worker is gone by using > GetBackgroundWorkerPid. Moreover, I have manually tested this > particular case before saying your patch is fine. Do you have some > other case in mind which I am missing? Oh, right. I had forgotten that I taught shm_mq_attach() to take an optional BackgroundWorkerHandle precisely to solve this kind of problem. I can't decide whether to feel smart because I did that or dumb because I forgot that I did it. But I think there's still a problem: given recent commits, that will cause us to ERROR out if we tried to read from a tuple queue for a worker which failed to start, or that starts and then dies before attaching to the queue. But there's no guarantee we'll try to read from that queue in the first place. Normally, that gets triggered when we get PROCSIG_PARALLEL_MESSAGE, but if the worker never started up in the first place, it can't send that, and the postmaster won't. In Thomas's test case, he's using 4 workers, and if even one of those manages to start, then it'll probably do so after any fork failures have already occurred, and the error will be noticed when that process sends its first message to the leader through the error queue, because it'll send PROCSIG_PARALLEL_MESSAGE via all the queues. If all of the workers fail to start, then that doesn't help. But it still manages to detect the failure in that case because it reaches WaitForParallelWorkersToFinish, which we just patched. But how does that happen, anyway? Shouldn't it get stuck waiting for the tuple queues to which nobody's attached yet? The reason it doesn't is because ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue, which causes the tuple queues to be regarded as detached if the worker fails to start. A detached tuple queue, in general, is not an error condition: it just means that worker has no more tuples. So I think what happens is as follows. If parallel_leader_participation = on, then the leader runs the whole plan, producing however many tuples the plan should have generated, and then afterwards waits for the workers to finish, tripping an error. If parallel_leader_participation = off, the leader doesn't try to run the plan and is perfectly happy with the fact that every worker has produced no tuples; since, as far as it knows, the plan is now fully executed, it waits for them to shut down, tripping an error. I guess that works, but it seems more like blind luck than good design. Parallel CREATE INDEX fails to be as "lucky" as Gather because there's nothing in parallel CREATE INDEX that lets it skip waiting for a worker which doesn't start -- and in general it seems unacceptable to impose a coding requirement that all future parallel operations must fail to notice the difference between a worker that completed successfully and one that never ran at all. If we made the Gather node wait only for workers that actually reached the Gather -- either by using a Barrier or by some other technique -- then this would be a lot less fragile. And the same kind of technique would work for parallel CREATE INDEX. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: