Re: [HACKERS] parallel.c oblivion of worker-startup failures - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date
Msg-id CA+TgmobihJVjjYXreB56CJ9jQUCC4PeuYCw03EwPWBA-yM-ubg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] parallel.c oblivion of worker-startup failures  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [HACKERS] parallel.c oblivion of worker-startup failures
List pgsql-hackers
On Tue, Jan 23, 2018 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Hmm, I think that case will be addressed because tuple queues can
> detect if the leader is not attached.  It does in code path
> shm_mq_receive->shm_mq_counterparty_gone.  In
> shm_mq_counterparty_gone, it can detect if the worker is gone by using
> GetBackgroundWorkerPid.  Moreover, I have manually tested this
> particular case before saying your patch is fine.  Do you have some
> other case in mind which I am missing?

Oh, right.  I had forgotten that I taught shm_mq_attach() to take an
optional BackgroundWorkerHandle precisely to solve this kind of
problem.  I can't decide whether to feel smart because I did that or
dumb because I forgot that I did it.

But I think there's still a problem: given recent commits, that will
cause us to ERROR out if we tried to read from a tuple queue for a
worker which failed to start, or that starts and then dies before
attaching to the queue.  But there's no guarantee we'll try to read
from that queue in the first place.  Normally, that gets triggered
when we get PROCSIG_PARALLEL_MESSAGE, but if the worker never started
up in the first place, it can't send that, and the postmaster won't.

In Thomas's test case, he's using 4 workers, and if even one of those
manages to start, then it'll probably do so after any fork failures
have already occurred, and the error will be noticed when that process
sends its first message to the leader through the error queue, because
it'll send PROCSIG_PARALLEL_MESSAGE via all the queues.  If all of the
workers fail to start, then that doesn't help.  But it still manages
to detect the failure in that case because it reaches
WaitForParallelWorkersToFinish, which we just patched.

But how does that happen, anyway?  Shouldn't it get stuck waiting for
the tuple queues to which nobody's attached yet?  The reason it
doesn't is because
ExecParallelCreateReaders() calls shm_mq_set_handle() for each queue,
which causes the tuple queues to be regarded as detached if the worker
fails to start.  A detached tuple queue, in general, is not an error
condition: it just means that worker has no more tuples.  So I think
what happens is as follows.  If parallel_leader_participation = on,
then the leader runs the whole plan, producing however many tuples the
plan should have generated, and then afterwards waits for the workers
to finish, tripping an error.  If parallel_leader_participation = off,
the leader doesn't try to run the plan and is perfectly happy with the
fact that every worker has produced no tuples; since, as far as it
knows, the plan is now fully executed, it waits for them to shut down,
tripping an error.

I guess that works, but it seems more like blind luck than good
design.  Parallel CREATE INDEX fails to be as "lucky" as Gather
because there's nothing in parallel CREATE INDEX that lets it skip
waiting for a worker which doesn't start -- and in general it seems
unacceptable to impose a coding requirement that all future parallel
operations must fail to notice the difference between a worker that
completed successfully and one that never ran at all.

If we made the Gather node wait only for workers that actually reached
the Gather -- either by using a Barrier or by some other technique --
then this would be a lot less fragile.  And the same kind of technique
would work for parallel CREATE INDEX.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: [HACKERS] parallel.c oblivion of worker-startup failures
Next
From: Alvaro Herrera
Date:
Subject: Re: copy.c allocation constant