Re: Blocking I/O, async I/O and io_uring - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Blocking I/O, async I/O and io_uring |
Date | |
Msg-id | 20201208040227.7rlzpfpdxoau4pvd@alap3.anarazel.de Whole thread Raw |
In response to | Blocking I/O, async I/O and io_uring (Craig Ringer <craig.ringer@enterprisedb.com>) |
Responses |
RE: Blocking I/O, async I/O and io_uring
Re: Blocking I/O, async I/O and io_uring |
List | pgsql-hackers |
Hi, On 2020-12-08 10:55:37 +0800, Craig Ringer wrote: > A new kernel API called io_uring has recently come to my attention. I > assume some of you (Andres?) have been following it for a while. Yea, I've spent a *lot* of time working on AIO support, utilizing io_uring. Recently Thomas also joined in the fun. I've given two talks referencing it (last pgcon, last pgday brussels), but otherwise I've not yet written much about. Things aren't *quite* right yet architecturally, but I think we're getting there. Thomas is working on making the AIO infrastructure portable (a worker based fallback, posix AIO support for freebsd & OSX). Once that's done, and some of the architectural thins are resolved, I plan to write a long email about what I think the right design is, and where I am at. The current state is at https://github.com/anarazel/postgres/tree/aio (but it's not a very clean history at the moment). There's currently no windows AIO support, but it shouldn't be too hard to add. My preliminary look indicates that we'd likely have to use overlapped IO with WaitForMultipleObjects(), not IOCP, since we need to be able to handle latches etc, which seems harder with IOCP. But perhaps we can do something using the signal handling emulation posting events onto IOCP instead. > io_uring appears to offer a way to make system calls including reads, > writes, fsync()s, and more in a non-blocking, batched and pipelined manner, > with or without O_DIRECT. Basically async I/O with usable buffered I/O and > fsync support. It has ordering support which is really important for us. My results indicate that we really want to have have, optional & not enabled by default of course, O_DIRECT support. We just can't benefit fully of modern SSDs otherwise. Buffered is also important, of course. > But I have no idea how to even begin to fit this into PostgreSQL's executor > pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative > in nature, with a push/pull executor pipeline. It seems to have been > recognised for some time that this is increasingly hurting our performance > and scalability as platforms become more and more parallel. > To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we > have to be able to dispatch I/O and do something else while we wait for the > results. So we need the ability to pipeline the executor and pipeline redo. > I thought I'd start the discussion on this and see where we can go with it. > What incremental steps can be done to move us toward parallelisable I/O > without having to redesign everything? I'm pretty sure that I've got the basics of this working pretty well. I don't think the executor architecture is as big an issue as you seem to think. There are further benefits that could be unlocked if we had a more flexible executor model (imagine switching between different parts of the query whenever blocked on IO - can't do that due to the stack right now). The way it currently works is that things like sequential scans, vacuum, etc use a prefetching helper which will try to use AIO to read ahead of the next needed block. That helper uses callbacks to determine the next needed block, which e.g. vacuum uses to skip over all-visible/frozen blocks. There's plenty other places that should use that helper, but we already can get considerably higher throughput for seqscans, vacuum on both very fast local storage, and high-latency cloud storage. Similarly, for writes there's a small helper to manage a write-queue of configurable depth, which currently is used to by checkpointer and bgwriter (but should be used in more places). Especially with direct IO checkpointing can be a lot faster *and* less impactful on the "regular" load. I've got asynchronous writing of WAL mostly working, but need to redesign the locking a bit further. Right now it's a win in some cases, but not others. The latter to a significant degree due to unnecessary blocking.... > I'm thinking that redo is probably a good first candidate. It doesn't > depend on the guts of the executor. It is much less sensitive to > ordering between operations in shmem and on disk since it runs in the > startup process. And it hurts REALLY BADLY from its single-threaded > blocking approach to I/O - as shown by an extension written by > 2ndQuadrant that can double redo performance by doing read-ahead on > btree pages that will soon be needed. Thomas has a patch for prefetching during WAL apply. It currently uses posix_fadvise(), but he took care that it'd be fairly easy to rebase it onto "real" AIO. Most of the changes necessary are pretty independent of posix_fadvise vs aio. Greetings, Andres Freund
pgsql-hackers by date: