Re: Blocking I/O, async I/O and io_uring - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Blocking I/O, async I/O and io_uring
Date
Msg-id 20201208040227.7rlzpfpdxoau4pvd@alap3.anarazel.de
Whole thread Raw
In response to Blocking I/O, async I/O and io_uring  (Craig Ringer <craig.ringer@enterprisedb.com>)
Responses RE: Blocking I/O, async I/O and io_uring
Re: Blocking I/O, async I/O and io_uring
List pgsql-hackers
Hi,

On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> A new kernel API called io_uring has recently come to my attention. I
> assume some of you (Andres?) have been following it for a while.

Yea, I've spent a *lot* of time working on AIO support, utilizing
io_uring. Recently Thomas also joined in the fun. I've given two talks
referencing it (last pgcon, last pgday brussels), but otherwise I've not
yet written much about. Things aren't *quite* right yet architecturally,
but I think we're getting there.

Thomas is working on making the AIO infrastructure portable (a worker
based fallback, posix AIO support for freebsd & OSX). Once that's done,
and some of the architectural thins are resolved, I plan to write a long
email about what I think the right design is, and where I am at.

The current state is at https://github.com/anarazel/postgres/tree/aio
(but it's not a very clean history at the moment).

There's currently no windows AIO support, but it shouldn't be too hard
to add. My preliminary look indicates that we'd likely have to use
overlapped IO with WaitForMultipleObjects(), not IOCP, since we need to
be able to handle latches etc, which seems harder with IOCP. But perhaps
we can do something using the signal handling emulation posting events
onto IOCP instead.


> io_uring appears to offer a way to make system calls including reads,
> writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
> with or without O_DIRECT. Basically async I/O with usable buffered I/O and
> fsync support. It has ordering support which is really important for us.

My results indicate that we really want to have have, optional & not
enabled by default of course, O_DIRECT support. We just can't benefit
fully of modern SSDs otherwise. Buffered is also important, of course.


> But I have no idea how to even begin to fit this into PostgreSQL's executor
> pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative
> in nature, with a push/pull executor pipeline. It seems to have been
> recognised for some time that this is increasingly hurting our performance
> and scalability as platforms become more and more parallel.

> To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we
> have to be able to dispatch I/O and do something else while we wait for the
> results. So we need the ability to pipeline the executor and pipeline redo.

> I thought I'd start the discussion on this and see where we can go with it.
> What incremental steps can be done to move us toward parallelisable I/O
> without having to redesign everything?

I'm pretty sure that I've got the basics of this working pretty well. I
don't think the executor architecture is as big an issue as you seem to
think. There are further benefits that could be unlocked if we had a
more flexible executor model (imagine switching between different parts
of the query whenever blocked on IO - can't do that due to the stack
right now).

The way it currently works is that things like sequential scans, vacuum,
etc use a prefetching helper which will try to use AIO to read ahead of
the next needed block. That helper uses callbacks to determine the next
needed block, which e.g. vacuum uses to skip over all-visible/frozen
blocks. There's plenty other places that should use that helper, but we
already can get considerably higher throughput for seqscans, vacuum on
both very fast local storage, and high-latency cloud storage.

Similarly, for writes there's a small helper to manage a write-queue of
configurable depth, which currently is used to by checkpointer and
bgwriter (but should be used in more places). Especially with direct IO
checkpointing can be a lot faster *and* less impactful on the "regular"
load.

I've got asynchronous writing of WAL mostly working, but need to
redesign the locking a bit further. Right now it's a win in some cases,
but not others. The latter to a significant degree due to unnecessary
blocking....


> I'm thinking that redo is probably a good first candidate. It doesn't
> depend on the guts of the executor. It is much less sensitive to
> ordering between operations in shmem and on disk since it runs in the
> startup process. And it hurts REALLY BADLY from its single-threaded
> blocking approach to I/O - as shown by an extension written by
> 2ndQuadrant that can double redo performance by doing read-ahead on
> btree pages that will soon be needed.

Thomas has a patch for prefetching during WAL apply. It currently uses
posix_fadvise(), but he took care that it'd be fairly easy to rebase it
onto "real" AIO. Most of the changes necessary are pretty independent of
posix_fadvise vs aio.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Huge memory consumption on partitioned table with FKs
Next
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: Blocking I/O, async I/O and io_uring