Re: [HACKERS] parallel.c oblivion of worker-startup failures - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] parallel.c oblivion of worker-startup failures |
Date | |
Msg-id | CAEepm=3j-LETxYHRDJN2JQ0H1NpE+ydv7uwHmcH8qDvV7d+89w@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] parallel.c oblivion of worker-startup failures (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [HACKERS] parallel.c oblivion of worker-startup failures
Re: [HACKERS] parallel.c oblivion of worker-startup failures Re: [HACKERS] parallel.c oblivion of worker-startup failures |
List | pgsql-hackers |
On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> In the case of Gather, for example, we won't >> WaitForParallelWorkersToFinish() until we've read all the tuples. If >> there's a tuple queue that does not yet have a sender, then we'll just >> wait for it to get one, even if the sender fell victim to a fork >> failure and is never going to show up. >> > > Hmm, I think that case will be addressed because tuple queues can > detect if the leader is not attached. It does in code path > shm_mq_receive->shm_mq_counterparty_gone. In > shm_mq_counterparty_gone, it can detect if the worker is gone by using > GetBackgroundWorkerPid. Moreover, I have manually tested this > particular case before saying your patch is fine. Do you have some > other case in mind which I am missing? Hmm. Yeah. I can't seem to reach a stuck case and was probably just confused and managed to confuse Robert too. If you make fork_process() fail randomly (see attached), I see that there are a couple of easily reachable failure modes (example session at bottom of message): 1. HandleParallelMessages() is reached and raises a "lost connection to parallel worker" error because shm_mq_receive() returns SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked GetBackgroundWorkerPid() just as you said. I guess that's happening because some other process is (coincidentally) sending PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a process is unexpectedly stopped. 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel worker failed to initialize" error. TupleQueueReaderNext() set done to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once again, that is because shm_mq_counterparty_gone() returned true. This is the bit Robert and I missed in our off-list discussion. As long as we always get our latch set by the postmaster after a fork failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is guaranteed to return BGWH_STOPPED after that, and as long as we only ever use latch/CFI loops to wait, and as long as we try to read from a shm_mq, then I don't see a failure mode that hangs. This doesn't help a parallel.c client that doesn't try to read from a shm_mq though, like parallel CREATE INDEX (using the proposed design without dynamic Barrier). We can't rely on a coincidental PROCSIG_PARALLEL_MESSAGE like in (1), and it's not going to try to read from a queue like in (2). So wouldn't it need a way to perform its own similar check for unexpectedly BGWH_STOPPED processes whenever its latch is set? If there were some way for the postmaster to cause reason PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of just notification via kill(SIGUSR1) when it fails to fork a parallel worker, we'd get (1) for free in any latch/CFI loop code. But I understand that we can't do that by project edict. === postgres=# create table foox as select generate_series(1, 10000) as a; SELECT 10000 postgres=# alter table foox set (parallel_workers = 4); ALTER TABLE postgres=# set max_parallel_workers_per_gather = 4; SET postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: parallel worker failed to initialize HINT: More details may be available in the server log. postgres=# select count(*) from foox; count ------- 10000 (1 row) postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: lost connection to parallel worker postgres=# select count(*) from foox; ERROR: lost connection to parallel worker -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date: