Re: Automatically sizing the IO worker pool - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Automatically sizing the IO worker pool |
Date | |
Msg-id | CA+hUKG+P80oR7yy5-67uHqwBWr9rux69BkQF9fwz=f0LuDn3Rw@mail.gmail.com Whole thread Raw |
In response to | Re: Automatically sizing the IO worker pool (Dmitry Dolgov <9erthalion6@gmail.com>) |
List | pgsql-hackers |
On Wed, Jul 30, 2025 at 10:15 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Thanks. I was experimenting with this approach, and realized there isn't > much metrics exposed about workers and the IO queue so far. Since the Hmm. You can almost infer the depth from the pg_aios view. All IOs in use are visible there, and the SUBMITTED ones are all either in the queue, currently being executed by a worker, or being executed synchronously by a regular backend because the queue was full and in that case it just falls back to synchronous execution. Perhaps we just need to be able to distinguish those three cases in that view. For the synchronous-in-submitter overflow case, I think f_sync should really show 't', and I'll post a patch for that shortly. For "currently executing in a worker", I wonder if we could have an "info" column that queries a new optional callback pgaio_iomethod_ops->get_info(ioh) where worker mode could return "worker 3", or something like that. > worker pool growth is based on the queue size and workers try to share > the load uniformly, it makes to have a system view to show those Actually it's not uniform: it tries to wake up the lowest numbered worker that advertises itself as idle, in that little bitmap of idle workers. So if you look in htop you'll see that worker 0 is the most busy, then worker 1, etc. Only if they are all quite busy does it become almost uniform, which probably implies you've reached hit io_max_workers and should probably set it higher (or without this patch, you should probably just increase io_workers manually, assuming your I/O hardware can take more). Originally I made it like that to give higher numbered workers a chance to time out (anticipating this patch). Later I found another reason to do it that way: When I tried uniform distribution using atomic_fetch_add(&distributor, 1) % nworkers to select the worker to wake up, avg(latency) and stddev(latency) were both higher for simple tests like the one attached to the first message, when running several copies of it concurrently. The concentrate-into-lowest-numbers design benefits from latch collapsing and allows the busier workers to avoid going back to sleep when they could immediately pick up a new job. I didn't change that in this patch, though I did tweak the "fan out" logic a bit, after some experimentation on several machines where I realised the code in master/18 is a bit over enthusiastic about that and has a higher spurious wakeup ratio (something this patch actually measures and tries to reduce). Here is one of my less successful attempts to do a round-robin system that tries to adjust the pool size with more engineering, but it was consistently worse on those latency statistics compared to this approach, and wasn't even as good at finding a good pool size, so eventually I realised that it was a dead end and my original work contrentrating concept was better: https://github.com/macdice/postgres/tree/io-worker-pool FWIW the patch in this branch is in this public branch: https://github.com/macdice/postgres/tree/io-worker-pool-3 > Regarding the worker pool growth approach, it sounds reasonable to me. Great to hear. I wonder what other kinds of testing we should do to validate this, but I am feeling quite confident about this patch and thinking it should probably go in sooner rather than later. > With static number of workers one needs to somehow find a number > suitable for all types of workload, where with this patch one needs only > to fiddle with the launch interval to handle possible spikes. It would > be interesting to investigate, how this approach would react to > different dynamics of the queue size. I've plotted one "spike" scenario > in the "Worker pool size response to queue depth", where there is a > pretty artificial burst of IO, making the queue size look like a step > function. If I understand the patch implementation correctly, it would > respond linearly over time (green line), one could also think about > applying a first order butterworth low pass filter to respond quicker > but still smooth (orange line). Interesting. There is only one kind of smoothing in the patch currently, relating to the pool size going down. It models spurious latch wakeups in an exponentially decaying ratio of wakeups:work. That's the only way I could find to deal with the inherent sloppiness of the wakeup mechanism with a shared queue: when you wake the lowest numbered idle worker as of some moment in time, it might lose the race against an even lower numbered worker that finishes its current job and steals the new job. When workers steal jobs, latency decreases, which is good, so instead of preventing it I eventually figured out that we should measure it, smooth it, and use it to limit wakeup propagation. I wonder if that naturally produces curves a bit like your butterworth line when it's going down already, but I'm not sure. As for the curve on the way up, hmm, I'm not sure. Yes, it goes up linearly and is limited by the launch delay, but I was thinking of that only as the way it grows when the *variation* in workload changes over a long time frame. In other words, maybe it's not so important how exactly it grows, it's more important that it achieves a steady state that can handle the oscillations and spikes in your workload. The idle timeout creates that steady state by holding the current pool size for quite a while, so that it can handle your quieter and busier moments immediately without having to adjust the pool size. In that other failed attempt I tried to model that more explicitly, with "active" workers and "spare" workers, with the active set sizes for average demand with uniform wakeups and the spare set sized for some number of standard deviations that are woken up only when the queue is high, but I could never really make it work well... > But in reality the queue size would be of course much more volatile even > on stable workloads, like in "Queue depth over time" (one can see > general oscillation, as well as different modes, e.g. where data is in > the page cache vs where it isn't). Event more, there is a feedback where > increasing number of workers would accelerate queue size decrease -- > based on [1] the system utilization for M/M/k depends on the arrival > rate, processing rate and number of processors, where pretty intuitively > more processors reduce utilization. But alas, as you've mentioned this > result exists for Poisson distribution only. > Btw, I assume something similar could be done to other methods as well? > I'm not up to date on io uring, can one change the ring depth on the > fly? Each backend's io_uring submission queue is configured at startup and not changeable later, but it is sized for the maximum possible number that each backend can submit, io_max_concurrency, which corresponds to the backend's portion of the array of PgAioHandle objects that is fixed. I suppose you could say that each backend's submission queue can't overflow at that level, because it's perfectly sized and not shared with other backends, or to put it another way, the equivalent of overflow is we won't try to submit more IOs than that. Worker mode has a shared submission queue, but falls back to synchronous execution if it's full, which is a bit weird as it makes your IOs jump the queue in a sense, and that is a good reason to want this patch so that the pool can try to find the size that avoids that instead of leaving the user in the dark. As for the equivalent of pool sizing inside io_uring (and maybe other AIO systems in other kernels), hmm.... in the absolute best cases worker threads can be skipped completely, eg for direct I/O queued straight to the device, but when used, I guess they have pretty different economics. A kernel can start a thread just by allocating a bit of memory and sticking it in a queue, and can also wake them (move them to a different scheduler queue) cheaply, but we have to fork a giant process that has to open all the files and build up its caches etc. So I think they just start threads on demand immediately on need without damping, with some kind of short grace period just to avoid those smaller costs being repeated. I'm no expert on those internal details, but our worker system clearly needs all this damping and steady state discovery heuristics due to the higher overheads and sloppy wakeups. Thinking more about our comparatively heavyweight I/O workers, there must also be affinity opportunities. If you somehow tended to use the same workers for a given database in a cluster with multiple active databases, then workers might accumulate fewer open file descriptors and SMgrRelation cache objects. If you had per-NUMA node pools and queues then you might be able to reduce contention, and maybe also cache line ping-pong on buffer headers considering that the submitter dirties the header, then the worker does (in the completion callback), and then the submitter accesses it again. I haven't investigated that. > As a side note, I was trying to experiment with this patch using > dm-mapper's delay feature to introduce an arbitrary large io latency and > see how the io queue is growing. But strangely enough, even though the > pure io latency was high, the queue growth was smaller than e.g. on a > real hardware under the same conditions without any artificial delay. Is > there anything obvious I'm missing that could have explained that? Could it be alternating full and almost empty due to method_worker.c's fallback to synchronous on overflow, which slows the submission down, or something like that, and then you're plotting an average depth that is lower than you expected? With the patch I'll share shortly to make pg_aios show a useful f_sync value it might be more obvious... About dm-mapper delays, I actually found it useful to hack up worker mode itself to simulate storage behaviours, for example swamped local disks or cloud storage with deep queues and no back pressure but artificial IOPS and bandwidth caps, etc. I was thinking about developing some proper settings to help with that kind of research: debug_io_worker_queue_size (changeable at runtime), debug_io_max_worker_queue_size (allocated at startup), debug_io_worker_{latency,bandwidth,iops} to introduce calculated sleeps, and debug_io_worker_overflow_policy=synchronous|wait so that you can disable the synchronous fallback that confuses matters. That'd be more convenient, portable and flexible than dm-mapper tricks I guess. I'd been imagining that as a tool to investigate higher level work on feedback control for read_stream.c as mentioned, but come to think of it, it could also be useful to understand things about the worker pool itself. That's vapourware though, for myself I just used dirty hacks last time I was working on that stuff. In other words, patches are most welcome if you're interested in that kind of thing. I am a bit tied up with multithreading at the moment and time grows short. I will come back to that problem in a little while and that patch is on my list as part of the infrastructure needed to prove things about the I/O stream feedback work I hope to share later...
pgsql-hackers by date: