Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: Asynchronous and "direct" IO support for PostgreSQL. |
Date | |
Msg-id | 20210402160636.ymdln4pzrt6kuza3@localhost Whole thread Raw |
In response to | Re: Asynchronous and "direct" IO support for PostgreSQL. (Dmitry Dolgov <9erthalion6@gmail.com>) |
List | pgsql-hackers |
Sorry for another late reply, finally found some time to formulate couple of thoughts. > On Thu, Feb 25, 2021 at 09:22:43AM +0100, Dmitry Dolgov wrote: > > On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote: > > > > > I'm curious if it makes sense > > > to explore possibility to have these sort of "backpressure", e.g. if > > > number of inflight requests is too large calculate inflight_limit a bit > > > lower than possible (to avoid hard performance deterioration when the db > > > is trying to do too much IO, and rather do it smooth). > > > > What I do think is needed and feasible (there's a bunch of TODOs in the > > code about it already) is to be better at only utilizing deeper queues > > when lower queues don't suffice. So we e.g. don't read ahead more than a > > few blocks for a scan where the query is spending most of the time > > "elsewhere. > > > > There's definitely also some need for a bit better global, instead of > > per-backend, control over the number of IOs in flight. That's not too > > hard to implement - the hardest probably is to avoid it becoming a > > scalability issue. > > > > I think the area with the most need for improvement is figuring out how > > we determine the queue depths for different things using IO. Don't > > really want to end up with 30 parameters influencing what queue depth to > > use for (vacuum, index builds, sequential scans, index scans, bitmap > > heap scans, ...) - but they benefit from a deeper queue will differ > > between places. Talking about parameters, from what I understand the actual number of queues (e.g. io_uring) created is specified by PGAIO_NUM_CONTEXTS, shouldn't it be configurable? Maybe in fact there should be not that many knobs after all - if the model assumes the storage has: * Some number of hardware queues, then the number of queues AIO implementation needs to use depends on it. For example, lowering number of contexts between different benchmark runs I could see that some of the hardware queues were significantly underutilized. Potentially there could be also such thing as too many contexts. * Certain bandwidth, then the submit batch size (io_max_concurrency or PGAIO_SUBMIT_BATCH_SIZE) depends on it. This will allow to distinguish attached storage with high bandwidth and high latency vs local storages. From what I see max_aio_in_flight is used as a queue depth for contexts, which is workload dependent and not easy to figure out as you mentioned. To avoid having 30 different parameters maybe it's more feasible to introduce "shallow" and "deep" queues, where particular depth for those could be derived from depth of hardware queues. The question which activity should use which queue is not easy, but if I get it right from queuing theory (assuming IO producers are stationary processes and fixed IO latency from the storage) it depends on IO arrivals distribution in every particular case and this in turn could be roughly estimated for each type of activity. One can expect different IO arrivals distributions for e.g. a normal point-query backend and a checkpoint or vacuum process, no matter what are the other conditions (collecting those for few benchmark runs gives indeed pretty distinct distributions). If I understand correctly, those contexts defined by PGAIO_NUM_CONTEXTS are the main working horse, right? I'm asking because there is also something called local_ring, but it seems there are no IOs submitted into those. Assuming that contexts are a main way of submitting IO, it would be also interesting to explore isolated for different purposes contexts. I haven't finished yet my changes here to give any results, but at least doing some tests with fio show different latencies, when two io_urings are processing mixed read/writes vs isolated read or writes. On the side note, at the end of the day there are so many queues - application queue, io_uring, mq software queue, hardware queue - I'm really curious if it would amplify tail latencies. Another thing I've noticed is AIO implementation is much more significantly affected by side IO activity than synchronous one. E.g. AIO version tps drops from tens of thousands to a couple of hundreds just because of some kworker started to flush dirty buffers (especially with disabled writeback throttling), while synchronous version doesn't suffer that much. Not sure what to make of it. Btw, overall I've managed to get better numbers from AIO implementation on IO bounded test cases with local NVME device, but non IO bounded were mostly a bit slower - is it expected, or am I missing something? Interesting thing to note is that io_uring implementation apparently relaxed requirements for polling operations, now one needs to have only CAP_SYS_NICE capability, not CAP_SYS_ADMIN. I guess theoretically there are no issues using it within the current design?
pgsql-hackers by date: