Re: Automatically sizing the IO worker pool - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Automatically sizing the IO worker pool
Date
Msg-id CA+hUKGJi6L2ftuTiKw2W4AAah6Uivs4JGiq58A7MjePbXpc4BA@mail.gmail.com
Whole thread Raw
In response to Re: Automatically sizing the IO worker pool  (Andres Freund <andres@anarazel.de>)
Responses Re: Automatically sizing the IO worker pool
Re: Automatically sizing the IO worker pool
List pgsql-hackers
On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <andres@anarazel.de> wrote:
> The if (worker == -1) is done for every to be submitted IO.  If there are no
> idle workers, we'd redo the pgaio_worker_choose_idle() every time.  ISTM it
> should just be:
>
>                 for (int i = 0; i < num_staged_ios; ++i)
>                 {
>                         Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
>                         if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
>                         {
>                                 /*
>                                  * Do the rest synchronously. If the queue is full, give up
>                                  * and do the rest synchronously. We're holding an exclusive
>                                  * lock on the queue so nothing can consume entries.
>                                  */
>                                 synchronous_ios = &staged_ios[i];
>                                 nsync = (num_staged_ios - i);
>
>                                 break;
>                         }
>                 }
>
>                 /* Choose one worker to wake for this batch. */
>                 if (worker == -1)
>                         worker = pgaio_worker_choose_idle(-1);

Well I didn't want to wake a worker if we'd failed to enqueue
anything.  Ahh, I could put it there and test nsync.  Or I guess I
could just do it anyway.  Considering that.

> > > I think both 'wakeups" and "ios" are a bit too generically named. Based on the
> > > names I have no idea what this heuristic might be.
> >
> > I have struggled to name them.  Does wakeup_count and io_count help?
>
> hist_wakeups, hist_ios?

Thanks, that's a good name.

> > No, we only set it if it isn't already set (like a latch), and only
> > send a pmsignal when we set it (like a latch), and the postmaster only
> > clears it if it can start a worker (unlike a latch).  That applies in
> > general, not just when we hit the cap of io_max_workers: while the
> > postmaster is waiting for launch interval to expire, it will leave the
> > flag set, suppressed for 100ms or whatever, and the in the special
> > case of io_max_workers, for as long as the count remains that high.
>
> I'm quite certain that's not how it actually ended up working with the prior
> version and the benchmark I showed, there indeed were a lot of requests to
> postmaster.  I think it's because pgaio_worker_cancel_grow() (forgot the old
> name already) very frequently clears the flag, just for it to be immediately
> set again.
>
>
> Yep, still happens, does require the max to be smaller than 32 though.
>
> While a lot of IO is happening, no new connections being started, and with
> 1781562 being postmaster's pid:
>
> perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
>
>  Performance counter stats for process id '1781562':
>
>              2,790      raw_syscalls:sys_enter
>
>        1.001872667 seconds time elapsed
>
>              2,814      raw_syscalls:sys_enter
>
>        1.001983049 seconds time elapsed
>
>              3,036      raw_syscalls:sys_enter
>
>        1.001705850 seconds time elapsed
>
>              2,982      raw_syscalls:sys_enter
>
>        1.001881364 seconds time elapsed
>
>
> I think it may need a timestamp in the shared state to not allow another
> postmaster wake until some time has elapsed, or something.

Hnng.  Studying...

> > I should have made it clearer that that's a secondary condition.  The
> > primary condition is: a worker wanted to wake another worker, but
> > found that none were idle.  Unfortunately the whole system is a bit
> > too asynchronous for that to be a reliable cue on its own.  So, I also
> > check if the queue appears to be (1) obviously growing: that's clearly
> > too long and must be introducing latency, or (2) varying "too much".
> > Which I detect in exactly the same way.
> >
> > Imagine a histogram that look like this:
> >
> > LOG:  depth 00: 7898
> > LOG:  depth 01: 1630
> > LOG:  depth 02: 308
> > LOG:  depth 03: 93
> > LOG:  depth 04: 40
> > LOG:  depth 05: 19
> > LOG:  depth 06: 6
> > LOG:  depth 07: 4
> > LOG:  depth 08: 0
> > LOG:  depth 09: 1
> > LOG:  depth 10: 1
> > LOG:  depth 11: 0
> > LOG:  depth 12: 0
> > LOG:  depth 13: 0
> >
> > If you're failing to find idle workers to wake up AND our managic
> > threshold is hit by something in that long tail, then it'll call for
> > backup.  Of course I'm totally sidestepping a lot of queueing theory
> > maths and just saying "I'd better be able to find an idle worker when
> > I want to" and if not, "there had better not be any outliers that
> > reach this far".
> >
> > I've written a longer explanation in a long comment.  Including a
> > little challenge for someone to do better with real science and maths.
> > I hope it's a bit clearer at least.
>
> Definitely good to have that comment.  Have to ponder it for a bit.

Let me try again.

Our goal is simple: process every IO immediately.  We have immediate
feedback that is simple: there's an IO in the queue and there is no
idle worker.  The only action we can take is simple: add one more
worker.  So we don't need to suffer through the maths required to
figure out the ideal k for our M/G/k queue system (I think that's what
we have?) or any of the inputs that would require*.  The problem is
that on its own, the test triggered far too easily because a worker
that is not marked idle might in fact be just about to pick up that IO
on the one the one hand, and because there might be rare
spikes/clustering on the other, so I cooled it off a bit by
additionally testing if the queue appears to be growing or spiking
beyond some threshold.  I think it's OK to let the queue grow a bit
before we are triggered anyway, so the precise value used doesn't seem
too critical.  Someone might be able to come up with a more defensible
value, but in the end I just wanted a value that isn't triggered by
the outliers I see in real systems that are keeping up.  We could tune
it lower and overshoot more, but this setting seems to work pretty
well.  It doesn't seem likely that a real system could achieve a
steady state that is introducing latency but isn't increasing over
time, and pool size adjustments are bound to lag anyway.

* It's probably quite hard for call centres to figure out the number
of agents required to make you wait for a certain length of time, but
it's easy to know if you had to wait and you wish they had more!

> I've not again looked through the details, but based on a relatively short
> experiment, the one problematic thing I see is the frequent postmaster
> requests.

Looking into that...



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Buildfarm misses running some contrib TAP tests
Next
From: Mihail Nikalayeu
Date:
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements