On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My team recently received a report about connection establishment times
> increasing substantially from v16 onwards. Upon further investigation,
> this seems to have something to do with commit 7389aad (which moved a lot
> of postmaster code out of signal handlers) in conjunction with workloads
> that generate many parallel workers. I've attached a set of reproduction
> steps. The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
> r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
> (>= 48).
Interesting.
> Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
> parallel worker processing much more responsive to the point of contending
> with incoming connections, and that before this change, the kernel balanced
> the execution of the signal handlers and ServerLoop() to prevent this. I
> don't have a concrete proposal yet, but I thought it was still worth
> starting a discussion. TBH I'm not sure we really need to do anything
> since this arguably comes down to a trade-off between connection and worker
> responsiveness.
One factor is:
* Check if the latch is set already. If so, leave the loop
* immediately, avoid blocking again. We don't attempt to report any
* other events that might also be satisfied.
If we had a way to say "no really, gimme everything you have", I guess
that'd help. Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.