Thread: Re: connection establishment versus parallel workers

Re: connection establishment versus parallel workers

From
Thomas Munro
Date:
On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My team recently received a report about connection establishment times
> increasing substantially from v16 onwards.  Upon further investigation,
> this seems to have something to do with commit 7389aad (which moved a lot
> of postmaster code out of signal handlers) in conjunction with workloads
> that generate many parallel workers.  I've attached a set of reproduction
> steps.  The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
> r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
> (>= 48).

Interesting.

> Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
> parallel worker processing much more responsive to the point of contending
> with incoming connections, and that before this change, the kernel balanced
> the execution of the signal handlers and ServerLoop() to prevent this.  I
> don't have a concrete proposal yet, but I thought it was still worth
> starting a discussion.  TBH I'm not sure we really need to do anything
> since this arguably comes down to a trade-off between connection and worker
> responsiveness.

One factor is:

         * Check if the latch is set already. If so, leave the loop
         * immediately, avoid blocking again. We don't attempt to report any
         * other events that might also be satisfied.

If we had a way to say "no really, gimme everything you have", I guess
that'd help.  Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.



Re: connection establishment versus parallel workers

From
Thomas Munro
Date:
On Thu, Dec 12, 2024 at 11:36 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... instead of
> reporting only the lowest priority one)

s/priority/position/



Re: connection establishment versus parallel workers

From
Nathan Bossart
Date:
On Fri, Dec 13, 2024 at 02:29:53AM +1300, Thomas Munro wrote:
> Here's an experimental patch to try changing that policy.  It improves
> the connection times on my small computer with your test, but I doubt
> I'm seeing the real issue.  But in theory, assuming a backlog of
> connections and workers to start, I think each server loop should be
> able to accept and fork one client backend, and fork up to 100
> (MAX_BGWORKERS_TO_LAUNCH) background workers.

Thanks for the quick response!  I'm taking a look at the patch...

-- 
nathan



Re: connection establishment versus parallel workers

From
Nathan Bossart
Date:
Sorry for the delay, and thanks again for digging into this.

On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
> 0001 patch is unchanged, 0002 patch sketches out a response to the
> observation a couple of paragraphs above.

Both of these patches seem to improve matters quite a bit.  I haven't yet
thought too deeply about it all, but upon a skim, your patches seem
entirely reasonable to me.

However, while this makes the test numbers for >= v16 look more like those
for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
cleanly to commit d872510.  I haven't figured out _why_ this commit is
impacting this particular test, but I figured I'd at least update the
thread with what we know so far.

-- 
nathan