Re: connection establishment versus parallel workers - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: connection establishment versus parallel workers
Date
Msg-id CA+hUKGLOcxUa6m7UinPN1gZXFyr92L8btG_pGTHPiWY2YbRw2w@mail.gmail.com
Whole thread Raw
Responses Re: connection establishment versus parallel workers
List pgsql-hackers
On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My team recently received a report about connection establishment times
> increasing substantially from v16 onwards.  Upon further investigation,
> this seems to have something to do with commit 7389aad (which moved a lot
> of postmaster code out of signal handlers) in conjunction with workloads
> that generate many parallel workers.  I've attached a set of reproduction
> steps.  The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
> r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
> (>= 48).

Interesting.

> Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
> parallel worker processing much more responsive to the point of contending
> with incoming connections, and that before this change, the kernel balanced
> the execution of the signal handlers and ServerLoop() to prevent this.  I
> don't have a concrete proposal yet, but I thought it was still worth
> starting a discussion.  TBH I'm not sure we really need to do anything
> since this arguably comes down to a trade-off between connection and worker
> responsiveness.

One factor is:

         * Check if the latch is set already. If so, leave the loop
         * immediately, avoid blocking again. We don't attempt to report any
         * other events that might also be satisfied.

If we had a way to say "no really, gimme everything you have", I guess
that'd help.  Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Assert failure on running a completed portal again
Next
From: Thomas Munro
Date:
Subject: Re: connection establishment versus parallel workers