Hi all
A new kernel API called io_uring has recently come to my attention. I assume some of you (Andres?) have been following it for a while.
io_uring appears to offer a way to make system calls including reads, writes, fsync()s, and more in a non-blocking, batched and pipelined manner, with or without O_DIRECT. Basically async I/O with usable buffered I/O and fsync support. It has ordering support which is really important for us.
This should be on our radar. The main barriers to benefiting from linux-aio based async I/O in postgres in the past has been its reliance on direct I/O, the various kernel-version quirks, platform portability, and its maybe-async-except-when-it's-randomly-not nature.
The kernel version and portability remain an issue with io_uring so it's not like this is something we can pivot over to completely. But we should probably take a closer look at it.
PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking I/O. If we can improve that then we could potentially realize some major increases in I/O utilization especially for bigger, less concurrent workloads. The most obvious candidates to benefit would be redo, logical apply, and bulk loading.
But I have no idea how to even begin to fit this into PostgreSQL's executor pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative in nature, with a push/pull executor pipeline. It seems to have been recognised for some time that this is increasingly hurting our performance and scalability as platforms become more and more parallel.
To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we have to be able to dispatch I/O and do something else while we wait for the results. So we need the ability to pipeline the executor and pipeline redo.
I thought I'd start the discussion on this and see where we can go with it. What incremental steps can be done to move us toward parallelisable I/O without having to redesign everything?
I'm thinking that redo is probably a good first candidate. It doesn't depend on the guts of the executor. It is much less sensitive to ordering between operations in shmem and on disk since it runs in the startup process. And it hurts REALLY BADLY from its single-threaded blocking approach to I/O - as shown by an extension written by 2ndQuadrant that can double redo performance by doing read-ahead on btree pages that will soon be needed.
Thoughts anybody?