Thread: Re: connection establishment versus parallel workers

Re: connection establishment versus parallel workers

From
Thomas Munro
Date:
On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My team recently received a report about connection establishment times
> increasing substantially from v16 onwards.  Upon further investigation,
> this seems to have something to do with commit 7389aad (which moved a lot
> of postmaster code out of signal handlers) in conjunction with workloads
> that generate many parallel workers.  I've attached a set of reproduction
> steps.  The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
> r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
> (>= 48).

Interesting.

> Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
> parallel worker processing much more responsive to the point of contending
> with incoming connections, and that before this change, the kernel balanced
> the execution of the signal handlers and ServerLoop() to prevent this.  I
> don't have a concrete proposal yet, but I thought it was still worth
> starting a discussion.  TBH I'm not sure we really need to do anything
> since this arguably comes down to a trade-off between connection and worker
> responsiveness.

One factor is:

         * Check if the latch is set already. If so, leave the loop
         * immediately, avoid blocking again. We don't attempt to report any
         * other events that might also be satisfied.

If we had a way to say "no really, gimme everything you have", I guess
that'd help.  Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.



Re: connection establishment versus parallel workers

From
Thomas Munro
Date:
On Thu, Dec 12, 2024 at 11:36 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... instead of
> reporting only the lowest priority one)

s/priority/position/



Re: connection establishment versus parallel workers

From
Nathan Bossart
Date:
On Fri, Dec 13, 2024 at 02:29:53AM +1300, Thomas Munro wrote:
> Here's an experimental patch to try changing that policy.  It improves
> the connection times on my small computer with your test, but I doubt
> I'm seeing the real issue.  But in theory, assuming a backlog of
> connections and workers to start, I think each server loop should be
> able to accept and fork one client backend, and fork up to 100
> (MAX_BGWORKERS_TO_LAUNCH) background workers.

Thanks for the quick response!  I'm taking a look at the patch...

-- 
nathan



Re: connection establishment versus parallel workers

From
Nathan Bossart
Date:
Sorry for the delay, and thanks again for digging into this.

On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
> 0001 patch is unchanged, 0002 patch sketches out a response to the
> observation a couple of paragraphs above.

Both of these patches seem to improve matters quite a bit.  I haven't yet
thought too deeply about it all, but upon a skim, your patches seem
entirely reasonable to me.

However, while this makes the test numbers for >= v16 look more like those
for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
cleanly to commit d872510.  I haven't figured out _why_ this commit is
impacting this particular test, but I figured I'd at least update the
thread with what we know so far.

-- 
nathan



Re: connection establishment versus parallel workers

From
Thomas Munro
Date:
On Mon, Jan 20, 2025 at 6:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's the WIP code I have up with for that so far.
>
> Remaining opportunities not attempted:
> 1.  When a child exits, we could use a hash table to find it by pid.
> 2.  When looking for a bgworker slot that is not in use, we could do
> something better than linear search.

I haven't had time to work on this again due to other projects, but I
wanted to write down the ideas I thought about for the record.
Obviously we'd want some kind of free list, but the postmaster would
need to push free slots into it when workers exit, and it can't use
locks or any data structures that can cause it to get stuck just
because a backend has corrupted shared memory.  It must always be able
to process shutdown commands and coordinate crash restarts, no matter
how bananas the backends go.  With that in mind:

Idea #1:  A CAS-based linked list, relying on CAS never being emulated
with locks, and probably requiring 16 bit indexes since we can only
expect 32 bit atomic hardware and you might need to change both the
head and tail of a hypothetical list head when going from empty to one
element.  But you have to convince yourself that it's OK to run a
CAS-loop in the postmaster, which might in theory might be prevented
from completing...

Idea #2:  The free list could be a simple circular buffer of slot
index numbers.  That would be symmetrical with
0001-Remove-BackgroundWorkerStateChange-s-outer-loop.patch's shared
memory "start" queue, and could be coded essentially the same way.
The "start" queue and the "free" queue would then both be simple
arrays with a head and a tail, and in both cases only the consumers
(regular backends) need an lwlock to serialise against each other,
while the producer (the postmaster) can get away with careful memory
barriers and just has to range-check the head/tail indexes to deal
with untrusted shared memory contents.  A rogue backend can jam up the
bgworker subsystem, but that's already true.  It still can't prevent
shutdown or crash restart.

Idea #3:  Suggested by Robert in an off-list chat about all this:
backends could maintain a shared memory free-list using existing dlist
technology protected by an lwlock.  When it's empty *they* (not the
postmaster) would fill it up again using a linear slot search.  It's
simple but not quite as satisfying, because it is still possible to
degrade to high frequency linear searches that only find a small
number of free slots each time if you're unlucky, ie you can entirely
fail to amortise.

Idea #4:  Maintain a bitmap of free slots indexes, which the
postmaster sets with atomic_fetch_or().

I lean towards idea #2, but haven't actually tried it.

A similar situation exists for DSM slot management, though that has
different complications: no postmaster interaction, but funky handle
requirements due to portability concerns.  I think the handles could
probably be changed to encode the slot index + generation for O(1)
lookup at "attach" time, and free slots could be stored in a circular
queue for O(1) slot allocation.  I think the use of random numbers
stemmed from SysV shared memory's need to find a free key in a 32 bit
OS-wide namespace (yuck).  I haven't looked at that code in a while
but I don't recall any reason why even those couldn't be hidden inside
the slot itself, instead of being exposed in the handle, forcing
linear searches.  A generation scheme would also be more robust
against weird random number collisions, and could detect handles that
were valid but now are no longer in a more obvious way, instead of "I
looked everywhere and I couldn't find it".