Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 5, 2021 at 3:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Admittedly, it seems unlikely that the difference could exceed
>> MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where
>> the limit on number of parallel workers is only 8. What I think is
>> more likely, given that these counters are unsigned, is that the
>> difference was actually negative. Which could be a bug, or it could
>> be an expectable race condition, or it could just be some flakiness
>> on lorikeet's part (that machine has had a lot of issues lately).
> I think that assertion was added by me, and I think the thought
> process was that the value shouldn't go negative and that if it does
> it's probably a bug which we might want to fix. But since the values
> are unsigned I could hardly check for < 0, so I did it this way
> instead.
> But since there's no memory barrier between the two loads, I guess
> there's no guarantee that they have the expected relationship, even if
> there is a memory barrier on the store side. I wonder if it's worth
> trying to tighten that up so that the assertion is more meaningful, or
> just give up and rip it out. I'm afraid that if we do have (or
> develop) bugs in this area, someone will discover that the effective
> max_parallel_workers value on their system slowly drifts up or down
> from the configured value, and we'll have no clue where things are
> going wrong. The assertion was intended to give us a chance of
> noticing that sort of problem in the buildfarm or on a developer's
> machine before the code gets out into the real world.
I follow your concern, but I'm not convinced that this assertion is
a useful aid; first because the asynchrony involved makes the edge
cases rather squishy, and second because allowing 1024 bogus
increments before complaining will likely mean that developer test
runs will not last long enough to trigger the assertion, and third
because if it does fire it's too far removed from the perpetrator
to be much help in figuring out what went wrong, or even if
anything *is* wrong.
I've not tried to trace the code, but I'm now a bit suspicious
that there is indeed a design bug here. I gather from the
comments that parallel_register_count is incremented by the
worker processes, which of course implies that a worker that
fails to reattach to shared memory won't do that. But
parallel_terminate_count is incremented by the postmaster.
If the postmaster will do that even in the case of a worker that
failed at startup, then lorikeet's symptoms are neatly explained.
I'd be more comfortable with this code if the increments and
decrements were handled by the same process.
regards, tom lane