Re: Automatically sizing the IO worker pool - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Automatically sizing the IO worker pool
Date
Msg-id 4vhrgxmx5w4cjr7vgegur3hbkuojt2iz23v4dqfypsgvl5bszi@xrxefkkcxdja
Whole thread
In response to Re: Automatically sizing the IO worker pool  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Automatically sizing the IO worker pool
List pgsql-hackers
Hi,

On 2026-04-08 11:18:51 +1200, Thomas Munro wrote:
> On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <andres@anarazel.de> wrote:
> > The if (worker == -1) is done for every to be submitted IO.  If there are no
> > idle workers, we'd redo the pgaio_worker_choose_idle() every time.  ISTM it
> > should just be:
> >
> >                 for (int i = 0; i < num_staged_ios; ++i)
> >                 {
> >                         Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
> >                         if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
> >                         {
> >                                 /*
> >                                  * Do the rest synchronously. If the queue is full, give up
> >                                  * and do the rest synchronously. We're holding an exclusive
> >                                  * lock on the queue so nothing can consume entries.
> >                                  */
> >                                 synchronous_ios = &staged_ios[i];
> >                                 nsync = (num_staged_ios - i);
> >
> >                                 break;
> >                         }
> >                 }
> >
> >                 /* Choose one worker to wake for this batch. */
> >                 if (worker == -1)
> >                         worker = pgaio_worker_choose_idle(-1);
> 
> Well I didn't want to wake a worker if we'd failed to enqueue
> anything.

I think it's worth waking up workers if there are idle ones and the queue is
full?



> > > No, we only set it if it isn't already set (like a latch), and only
> > > send a pmsignal when we set it (like a latch), and the postmaster only
> > > clears it if it can start a worker (unlike a latch).  That applies in
> > > general, not just when we hit the cap of io_max_workers: while the
> > > postmaster is waiting for launch interval to expire, it will leave the
> > > flag set, suppressed for 100ms or whatever, and the in the special
> > > case of io_max_workers, for as long as the count remains that high.
> >
> > I'm quite certain that's not how it actually ended up working with the prior
> > version and the benchmark I showed, there indeed were a lot of requests to
> > postmaster.  I think it's because pgaio_worker_cancel_grow() (forgot the old
> > name already) very frequently clears the flag, just for it to be immediately
> > set again.
> >
> >
> > Yep, still happens, does require the max to be smaller than 32 though.
> >
> > While a lot of IO is happening, no new connections being started, and with
> > 1781562 being postmaster's pid:
> >
> > perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
> >
> >
> >              2,982      raw_syscalls:sys_enter
> >
> >        1.001881364 seconds time elapsed
> >
> >
> > I think it may need a timestamp in the shared state to not allow another
> > postmaster wake until some time has elapsed, or something.
> 
> Hnng.  Studying...

I suspect the primary reasonis that pgaio_worker_request_grow() is triggered
even when io_worker_control->nworkers is >= io_max_workers.


I suspect there's also pingpong between submission not finding any workers
idle, requesting growth, and workers being idle for a short period, then the
same thing starting again.

Seems like there should be two fields. One saying "notify postmaster again"
and one "postmaster start a worker".  The former would only be cleared by
postmaster after the timeout.


> Our goal is simple: process every IO immediately.  We have immediate
> feedback that is simple: there's an IO in the queue and there is no
> idle worker.  The only action we can take is simple: add one more
> worker.  So we don't need to suffer through the maths required to
> figure out the ideal k for our M/G/k queue system (I think that's what
> we have?) or any of the inputs that would require*.  The problem is
> that on its own, the test triggered far too easily because a worker
> that is not marked idle might in fact be just about to pick up that IO

Is that case really concerning? As long as you have some rate limiting about
the start rate, starting another worker when there are no idle workers seems
harmless?  Afaict it's fairly self limiting.


> on the one the one hand, and because there might be rare
> spikes/clustering on the other, so I cooled it off a bit by
> additionally testing if the queue appears to be growing or spiking
> beyond some threshold.  I think it's OK to let the queue grow a bit
> before we are triggered anyway, so the precise value used doesn't seem
> too critical.  Someone might be able to come up with a more defensible
> value, but in the end I just wanted a value that isn't triggered by
> the outliers I see in real systems that are keeping up.  We could tune
> it lower and overshoot more, but this setting seems to work pretty
> well.  It doesn't seem likely that a real system could achieve a
> steady state that is introducing latency but isn't increasing over
> time, and pool size adjustments are bound to lag anyway.

Yea, I don't think the precise logic matters that much as long as we ramp up
reasonably fast without being crazy and ramp up a bit faster.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Xuneng Zhou
Date:
Subject: Re: Implement waiting for wal lsn replay: reloaded
Next
From: Thomas Munro
Date:
Subject: Re: Automatically sizing the IO worker pool