Thread: Automatically sizing the IO worker pool

Automatically sizing the IO worker pool

From
Thomas Munro
Date:
It's hard to know how to set io_workers=3.  If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO.  If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it.  Here's a patch to replace that GUC with:

      io_min_workers=1
      io_max_workers=8
      io_worker_idle_timeout=60s
      io_worker_launch_interval=500ms

It grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out.  IO jobs were already
concentrated into the lowest numbered workers, partly because that
seemed to have marginally better latency than anything else tried so
far due to latch collapsing with lucky timing, and partly in
anticipation of this.

The patch also reduces bogus wakeups a bit by being a bit more
cautious about fanout.  That could probably be improved a lot more and
needs more research.  It's quite tricky to figure out how to suppress
wakeups without throwing potential concurrency away.

The first couple of patches are independent of this topic, and might
be potential cleanups/fixes for master/v18.  The last is a simple
latency test.

Ideas, testing, flames etc welcome.

Attachment

Re: Automatically sizing the IO worker pool

From
Jose Luis Tallon
Date:
On 12/4/25 18:59, Thomas Munro wrote:
> It's hard to know how to set io_workers=3.

Hmmm.... enable the below behaviour if "io_workers=auto" (default) ?

Sometimes being able to set this kind of parameters manually helps 
tremendously with specific workloads... :S

> [snip]
> Here's a patch to replace that GUC with:
>
>        io_min_workers=1
>        io_max_workers=8
>        io_worker_idle_timeout=60s
>        io_worker_launch_interval=500ms

Great as defaults / backwards compat with io_workers=auto. Sounds more 
user-friendly to me, at least....

> [snip]
>
> Ideas, testing, flames etc welcome.

Logic seems sound, if a bit daunting for inexperienced users --- well, 
maybe just a bit more than it is now, but ISTM evolution should try and 
flatten novices' learning curve, right?


Just .02€, though.


Thanks,

-- 
Parkinson's Law: Work expands to fill the time alloted to it.




Re: Automatically sizing the IO worker pool

From
Thomas Munro
Date:
On Mon, Apr 14, 2025 at 5:45 AM Jose Luis Tallon
<jltallon@adv-solutions.net> wrote:
> On 12/4/25 18:59, Thomas Munro wrote:
> > It's hard to know how to set io_workers=3.
>
> Hmmm.... enable the below behaviour if "io_workers=auto" (default) ?

Why not just delete io_workers?  If you really want a fixed number,
you can set io_min_workers==io_max_workers.

What should io_max_workers default to?  I guess it could be pretty
large without much danger, but I'm not sure.  If it's a small value,
an overloaded storage system goes through two stages: first it fills
the queue up with a backlog of requests until it overflows because the
configured maximum of workers isn't keeping up, and then new
submissions start falling back to synchronous IO, sort of jumping
ahead of the queued backlog, but also stalling if the real reason is
that the storage itself isn't keeping up.  Whether it'd be better for
the IO worker pool to balloon all the way up to 32 processes (an
internal limit) if required to try to avoid that with default
settings, I'm not entirely sure.  Maybe?  Why not at least try to get
all the concurrency possible, before falling back to synchronous?
Queued but not running IOs seem to be strictly worse than queued but
not even trying to run.  I'd be interested to hear people's thoughts
and experiences actually trying different kinds of workloads on
different kinds of storage.  Whether adding more concurrency actually
helps or just generates a lot of useless new processes before the
backpressure kicks in depends on why it's not keeping up, eg hitting
IOPS, throughput or concurrency limits in the storage.  In later work
I hope we can make higher levels smarter about understanding whether
requesting more concurrency helps or hurts with feedback (that's quite
a hard problem that some of my colleagues have been looking into), but
the simpler question here seems to be: should this fairly low level
system-wide setting ship with a default that includes any preconceived
assumptions about that?

It's superficially like max_parallel_workers, which ships with a
default of 8, and that's basically where I plucked that 8 from in the
current patch for lack of a serious idea to propose yet.  But it's
also more complex than CPU: you know how many cores you have and you
know things about your workload, but even really small "on the metal"
systems probably have a lot more concurrent I/O capacity -- perhaps
depending on the type of operation! (and so far we only have reads) --
than CPU cores.  Especially once you completely abandon the idea that
anyone runs databases on spinning rust in modern times, even on low
end systems, which I think we've more or less agreed to assume these
days with related changes such as the recent *_io_concurrency default
change (1->16).  It's actually pretty hard to drive a laptop up to
needing more half a dozen or a dozen or a dozen or so workers with
this patch for especially without debug_io_direct=data ie with fast
double-buffered I/O, but cloud environments may also be where most
databases run these days, and low end cloud configurations have
arbitrary made up limits that may be pretty low, so it all depends....
I really don't know, but one idea is that we could leave it open as
possible, and let users worry about that with higher-level settings
and the query concurrency they choose to generate...
io_method=io_uring is effectively open, so why should io_method=worker
be any different by default?  Just some thoughts.  I'm not sure.