Re: [HACKERS] parallel.c oblivion of worker-startup failures - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] parallel.c oblivion of worker-startup failures
Date
Msg-id CA+TgmoaiSS91k7OigH7nUXye5EiVnJiTN-jS93FX4U_Axiu9tA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] parallel.c oblivion of worker-startup failures  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [HACKERS] parallel.c oblivion of worker-startup failures  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] parallel.c oblivion of worker-startup failures  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Dec 6, 2017 at 1:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Looks good to me.

Committed and back-patched to 9.4 where dynamic background workers
were introduced.

>> Barring objections, I'll commit and
>> back-patch this; then we can deal with the other part of this
>> afterward.
>
> Sure, I will rebase on top of the commit unless you have some comments
> on the remaining part.

I'm not in love with that part of the fix; the other parts of that if
statement are just testing variables, and a function call that takes
and releases an LWLock is a lot more expensive.  Furthermore, that
test will often be hit in practice, because we'll often arrive at this
function before workers have actually finished.  On top of that, we'll
typically arrive here having already communicated with the worker in
some way, such as by receiving tuples, which means that in most cases
we already know that the worker was alive at least at some point, and
therefore the extra test isn't necessary.  We only need that test, if
I understand correctly, to cover the failure-to-launch case, which is
presumably very rare.

All that having been said, I don't quite know how to do it better.  It
would be easy enough to modify this function so that if it ever
received a result other then BGWH_NOT_YET_STARTED for a worker, it
wouldn't call this function again for that worker.  However, that's
only mitigating the problem, not really fixing it.  We'll still end up
frequently inquiring - admittedly only once - about the status of a
worker with which we've previously communicated via the tuple queue.
Maybe we just have to live with that problem, but do you (or does
anyone else) have an idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Usage of epoch in txid_current
Next
From: Andrew Dunstan
Date:
Subject: ALTER TABLE ADD COLUMN fast default