Thread: Automatically sizing the IO worker pool

Automatically sizing the IO worker pool

From
Thomas Munro
Date:
It's hard to know how to set io_workers=3.  If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO.  If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it.  Here's a patch to replace that GUC with:

      io_min_workers=1
      io_max_workers=8
      io_worker_idle_timeout=60s
      io_worker_launch_interval=500ms

It grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out.  IO jobs were already
concentrated into the lowest numbered workers, partly because that
seemed to have marginally better latency than anything else tried so
far due to latch collapsing with lucky timing, and partly in
anticipation of this.

The patch also reduces bogus wakeups a bit by being a bit more
cautious about fanout.  That could probably be improved a lot more and
needs more research.  It's quite tricky to figure out how to suppress
wakeups without throwing potential concurrency away.

The first couple of patches are independent of this topic, and might
be potential cleanups/fixes for master/v18.  The last is a simple
latency test.

Ideas, testing, flames etc welcome.

Attachment

Re: Automatically sizing the IO worker pool

From
Jose Luis Tallon
Date:
On 12/4/25 18:59, Thomas Munro wrote:
> It's hard to know how to set io_workers=3.

Hmmm.... enable the below behaviour if "io_workers=auto" (default) ?

Sometimes being able to set this kind of parameters manually helps 
tremendously with specific workloads... :S

> [snip]
> Here's a patch to replace that GUC with:
>
>        io_min_workers=1
>        io_max_workers=8
>        io_worker_idle_timeout=60s
>        io_worker_launch_interval=500ms

Great as defaults / backwards compat with io_workers=auto. Sounds more 
user-friendly to me, at least....

> [snip]
>
> Ideas, testing, flames etc welcome.

Logic seems sound, if a bit daunting for inexperienced users --- well, 
maybe just a bit more than it is now, but ISTM evolution should try and 
flatten novices' learning curve, right?


Just .02€, though.


Thanks,

-- 
Parkinson's Law: Work expands to fill the time alloted to it.




Re: Automatically sizing the IO worker pool

From
Thomas Munro
Date:
On Mon, Apr 14, 2025 at 5:45 AM Jose Luis Tallon
<jltallon@adv-solutions.net> wrote:
> On 12/4/25 18:59, Thomas Munro wrote:
> > It's hard to know how to set io_workers=3.
>
> Hmmm.... enable the below behaviour if "io_workers=auto" (default) ?

Why not just delete io_workers?  If you really want a fixed number,
you can set io_min_workers==io_max_workers.

What should io_max_workers default to?  I guess it could be pretty
large without much danger, but I'm not sure.  If it's a small value,
an overloaded storage system goes through two stages: first it fills
the queue up with a backlog of requests until it overflows because the
configured maximum of workers isn't keeping up, and then new
submissions start falling back to synchronous IO, sort of jumping
ahead of the queued backlog, but also stalling if the real reason is
that the storage itself isn't keeping up.  Whether it'd be better for
the IO worker pool to balloon all the way up to 32 processes (an
internal limit) if required to try to avoid that with default
settings, I'm not entirely sure.  Maybe?  Why not at least try to get
all the concurrency possible, before falling back to synchronous?
Queued but not running IOs seem to be strictly worse than queued but
not even trying to run.  I'd be interested to hear people's thoughts
and experiences actually trying different kinds of workloads on
different kinds of storage.  Whether adding more concurrency actually
helps or just generates a lot of useless new processes before the
backpressure kicks in depends on why it's not keeping up, eg hitting
IOPS, throughput or concurrency limits in the storage.  In later work
I hope we can make higher levels smarter about understanding whether
requesting more concurrency helps or hurts with feedback (that's quite
a hard problem that some of my colleagues have been looking into), but
the simpler question here seems to be: should this fairly low level
system-wide setting ship with a default that includes any preconceived
assumptions about that?

It's superficially like max_parallel_workers, which ships with a
default of 8, and that's basically where I plucked that 8 from in the
current patch for lack of a serious idea to propose yet.  But it's
also more complex than CPU: you know how many cores you have and you
know things about your workload, but even really small "on the metal"
systems probably have a lot more concurrent I/O capacity -- perhaps
depending on the type of operation! (and so far we only have reads) --
than CPU cores.  Especially once you completely abandon the idea that
anyone runs databases on spinning rust in modern times, even on low
end systems, which I think we've more or less agreed to assume these
days with related changes such as the recent *_io_concurrency default
change (1->16).  It's actually pretty hard to drive a laptop up to
needing more half a dozen or a dozen or a dozen or so workers with
this patch for especially without debug_io_direct=data ie with fast
double-buffered I/O, but cloud environments may also be where most
databases run these days, and low end cloud configurations have
arbitrary made up limits that may be pretty low, so it all depends....
I really don't know, but one idea is that we could leave it open as
possible, and let users worry about that with higher-level settings
and the query concurrency they choose to generate...
io_method=io_uring is effectively open, so why should io_method=worker
be any different by default?  Just some thoughts.  I'm not sure.



Re: Automatically sizing the IO worker pool

From
Thomas Munro
Date:
On Sun, May 25, 2025 at 7:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
> > It's hard to know how to set io_workers=3.  If it's too small,
> > io_method=worker's small submission queue overflows and it silently
> > falls back to synchronous IO.  If it's too high, it generates a lot of
> > pointless wakeups and scheduling overhead, which might be considered
> > an independent problem or not, but having the right size pool
> > certainly mitigates it.  Here's a patch to replace that GUC with:
> >
> >       io_min_workers=1
> >       io_max_workers=8
> >       io_worker_idle_timeout=60s
> >       io_worker_launch_interval=500ms
> >
> > It grows the pool when a backlog is detected (better ideas for this
> > logic welcome), and lets idle workers time out.
>
> I like the idea. In fact, I've been pondering about something like a
> "smart" configuration for quite some time, and convinced that a similar
> approach needs to be applied to many performance-related GUCs.
>
> Idle timeout and launch interval serving as a measure of sensitivity
> makes sense to me, growing the pool when a backlog (queue_depth >
> nworkers, so even a slightest backlog?) is detected seems to be somewhat
> arbitrary. From what I understand the pool growing velocity is constant
> and do not depend on the worker demand (i.e. queue_depth)? It may sounds
> fancy, but I've got an impression it should be possible to apply what's
> called a "low-pass filter" in the control theory (sort of a transfer
> function with an exponential decay) to smooth out the demand and adjust
> the worker pool based on that.
>
> As a side note, it might be far fetched, but there are instruments in
> queueing theory to figure out how much workers are needed to guarantee a
> certain low queueing probability, but for that one needs to have an
> average arrival rate (in our case, average number of IO operations
> dispatched to workers) and an average service rate (average number of IO
> operations performed by workers).

Hi Dmitry,

Thanks for looking, and yeah these are definitely the right sort of
questions.  I will be both unsurprised and delighted if someone can
bring some more science to this problem.  I did read up on Erlang's
formula C ("This formula is used to determine the number of agents or
customer service representatives needed to staff a call centre, for a
specified desired probability of queuing" according to Wikipedia) and
a bunch of related textbook stuff.  And yeah I had a bunch of
exponential moving averages of various values using scaled fixed point
arithmetic (just a bunch of shifts and adds) to smooth inputs, in
various attempts.  But ... I'm not even sure if we can say that our
I/O arrivals have a Poisson distribution, since they are not all
independent.  I tried more things too, while I was still unsure what I
should even be optimising for.  My current answer to that is: low
latency with low variance, as seen with io_uring.

In this version I went back to basics and built something that looks
more like the controls of a classic process/thread pool (think Apache)
or connection pool (think JDBC), with a couple of additions based on
intuition: (1) a launch interval, which acts as a bit of damping
against overshooting on brief bursts that are too far apart, and (2)
the queue length > workers * k as a simple way to determine that
latency is being introduced by not having enough workers.  Perhaps
there is a good way to compute an adaptive value for k with some fancy
theories, but k=1 seems to have *some* basis: that's the lowest number
which the pool is too small and *certainly* introducing latency, but
any lower constant is harder to defend because we don't know how many
workers are already awake and about to consume tasks.  Something from
queuing theory might provide an adaptive value, but in the end, I
figured we really just want to know if the queue is growing ie in
danger of overflowing (note: the queue is small!  64, and not
currently changeable, more on that later, and the overflow behaviour
is synchronous I/O as back-pressure).  You seem to be suggesting that
k=1 sounds too low, not too high, but there is that separate
time-based defence against overshoot in response to rare bursts.

You could get more certainty about jobs already about to be consumed
by a worker that is about to dequeue, by doing a lot more book
keeping: assigning them to workers on submission (separate states,
separate queues, various other ideas I guess).  But everything I tried
like that caused latency or latency variance to go up, because it
missed out on the chance for another worker to pick it up sooner
opportunistically.  This arrangement has the most stable and
predictable pool size and lowest avg latency and stddev(latency) I
have managed to come up with so far.  That said, we have plenty of
time to experiment with better ideas if you want to give it a shot or
propose concrete ideas, given that I missed v18 :-)

About control theory... yeah.  That's an interesting bag of tricks.
FWIW Melanie and (more recently) I have looked into textbook control
algorithms at a higher level of the I/O stack (and Melanie gave a talk
about other applications in eg VACUUM at pgconf.dev).  In
read_stream.c, where I/O demand is created, we've been trying to set
the desired I/O concurrency level and thus lookahead distance with
adaptive feedback.  We've tried a lot of stuff.  I hope we can share
some concept patches some time soon, well, maybe in this cycle.  Some
interesting recent experiments produced graphs that look a lot like
the ones in the book "Feedback Control for Computer Systems" (an easy
software-person book I found for people without an engineering/control
theory background where the problems match our world more closely, cf
typical texts that are about controlling motors and other mechanical
stuff...).  Experimental goals are: find the the smallest concurrent
I/O request level (and thus lookahead distance and thus speculative
work done and buffers pinned) that keeps the I/O stall probability
near zero (and keep adapting, since other queries and applications are
sharing system I/O queues), and if that's not even possible, find the
highest concurrent I/O request level that doesn't cause extra latency
due to queuing in lower levels (I/O workers, kernel, ...,  disks).
That second part is quite hard.  In other words, if higher levels own
that problem and bring the adaptivity, then perhaps io_method=worker
can get away with being quite stupid.  Just a thought...



Re: Automatically sizing the IO worker pool

From
Dmitry Dolgov
Date:
On Mon, May 26, 2025, 8:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
But ... I'm not even sure if we can say that our
I/O arrivals have a Poisson distribution, since they are not all
independent.

Yeah, a good point, one have to be careful with assumptions about distribution -- from what I've read many processes in computer systems are better described by a Pareto. But the beauty of the queuing theory is that many results are independent from the distribution (not sure about dependencies though).

In this version I went back to basics and built something that looks
more like the controls of a classic process/thread pool (think Apache)
or connection pool (think JDBC), with a couple of additions based on
intuition: (1) a launch interval, which acts as a bit of damping
against overshooting on brief bursts that are too far apart, and (2)
the queue length > workers * k as a simple way to determine that
latency is being introduced by not having enough workers.  Perhaps
there is a good way to compute an adaptive value for k with some fancy
theories, but k=1 seems to have *some* basis: that's the lowest number
which the pool is too small and *certainly* introducing latency, but
any lower constant is harder to defend because we don't know how many
workers are already awake and about to consume tasks.  Something from
queuing theory might provide an adaptive value, but in the end, I
figured we really just want to know if the queue is growing ie in
danger of overflowing (note: the queue is small!  64, and not
currently changeable, more on that later, and the overflow behaviour
is synchronous I/O as back-pressure).  You seem to be suggesting that
k=1 sounds too low, not too high, but there is that separate
time-based defence against overshoot in response to rare bursts.

I probably had to start with a statement that I find the current approach reasonable, and I'm only curious if there is more to get out of it. I haven't benchmarked the patch yet (plan getting to it when I'll get back), and can imagine practical considerations significantly impacting any potential solution.

About control theory... yeah.  That's an interesting bag of tricks.
FWIW Melanie and (more recently) I have looked into textbook control
algorithms at a higher level of the I/O stack (and Melanie gave a talk
about other applications in eg VACUUM at pgconf.dev).  In
read_stream.c, where I/O demand is created, we've been trying to set
the desired I/O concurrency level and thus lookahead distance with
adaptive feedback.  We've tried a lot of stuff.  I hope we can share
some concept patches some time soon, well, maybe in this cycle.  Some
interesting recent experiments produced graphs that look a lot like
the ones in the book "Feedback Control for Computer Systems" (an easy
software-person book I found for people without an engineering/control
theory background where the problems match our world more closely, cf
typical texts that are about controlling motors and other mechanical
stuff...).  Experimental goals are: find the the smallest concurrent
I/O request level (and thus lookahead distance and thus speculative
work done and buffers pinned) that keeps the I/O stall probability
near zero (and keep adapting, since other queries and applications are
sharing system I/O queues), and if that's not even possible, find the
highest concurrent I/O request level that doesn't cause extra latency
due to queuing in lower levels (I/O workers, kernel, ...,  disks).
That second part is quite hard.  In other words, if higher levels own
that problem and bring the adaptivity, then perhaps io_method=worker
can get away with being quite stupid.  Just a thought...

Looking forward to it. And thanks for the reminder about the talk, wanted to watch it already long time ago, but somehow didn't managed yet.