Re: [HACKERS] parallel.c oblivion of worker-startup failures - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date
Msg-id CAA4eK1+iw=4W1hBkKqTqvoOo4h_4srz+w1qE+t9cUQJTt5kBDQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] parallel.c oblivion of worker-startup failures  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] parallel.c oblivion of worker-startup failures  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Dec 21, 2017 at 2:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> That is not the main point I am bothered about.  I am concerned that
>> the patch proposed by you can lead to hang if there is a crash (abrupt
>> failure like proc_exit(1)) after attaching to the error queue. This is
>> explained in my email up thread [1].  I think we can fix it by trying
>> to read the queues as mentioned by you, but was not sure if that is
>> the best way to deal with it, so proposed an alternate solution.
>
> OK.  My core point is that I really, really don't want Gather or
> Gather Merge or parallel CREATE INDEX to have to worry about these
> cases; I believe strongly that it is best if these cases cause an
> error at the ParallelContext layer.   I read your email as suggesting
> the opposite, but maybe I misunderstood you.
>
> One thing I thought about is introducing a new state
> BGWH_STARTUP_FAILED, as distinguished from BGWH_STOPPED.  However, it
> seems problematic.  Once the worker is unregistered we wouldn't know
> which one to return, especially if the slot has been reused.
>

What if we don't allow to reuse such slots till the backend/session
that has registered it performs unregister?  Currently, we don't seem
to have an API corresponding to Register*BackgroundWorker() which can
be used to unregister, but maybe we can provide such an API.

> Furthermore, it doesn't help in the case where the worker starts and
> immediately exits without attaching to the DSM.
>

Yeah, but can't we detect that case?  After the worker exits, we can
know its exit status as is passed to CleanupBackgroundWorker, we can
use that to mark the worker state as  BGWH_ERROR_STOPPED (or something
like BGWH_IMMEDIATE_STOPPED).

I think above way sounds invasive, but it seems to me that it can be
used by other users of background workers as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: condition variable cleanup and subtransactions
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] path toward faster partition pruning