Thread: Asynchronous and "direct" IO support for PostgreSQL.

Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

over the last ~year I spent a lot of time trying to figure out how we could
add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
there's still a *lot* of open questions, I think I now have a decent handle on
most of the bigger architectural questions.  Thus this long email.


Just to be clear: I don't expect the current to design to survive as-is. If
there's a few sentences below that sound a bit like describing the new world,
that's because they're from the README.md in the patch series...


## Why Direct / unbuffered IO?

The main reason to want to use Direct IO are:

- Lower CPU usage / higher throughput. Particularly on modern storage
  buffered writes are bottlenecked by the operating system having to
  copy data from the kernel's page cache to postgres buffer pool using
  the CPU. Whereas direct IO can often move the data directly between
  the storage devices and postgres' buffer cache, using DMA. While
  that transfer is ongoing, the CPU is free to perform other work,
  with little impact.
- Avoiding double buffering between operating system cache and
  postgres' shared_buffers.
- Better control over the timing and pace of dirty data writeback.
- Potential for concurrent WAL writes (via O_DIRECT | O_DSYNC writes)


The main reason *not* to use Direct IO are:

- Without AIO, Direct IO is unusably slow for most purposes.
- Even with AIO, many parts of postgres need to be modified to perform
  explicit prefetching.
- In situations where shared_buffers cannot be set appropriately
  large, e.g. because there are many different postgres instances
  hosted on shared hardware, performance will often be worse then when
  using buffered IO.


## Why Asynchronous IO

- Without AIO we cannot use DIO

- Without asynchronous IO (AIO) PG has to rely on the operating system
  to hide the cost of synchronous IO from Postgres. While this works
  surprisingly well in a lot of workloads, it does not do as good a job
  on prefetching and controlled writeback as we would like.
- There are important expensive operations like fdatasync() where the
  operating system cannot hide the storage latency. This is particularly
  important for WAL writes, where the ability to asynchronously issue
  fdatasync() or O_DSYNC writes can yield significantly higher
  throughput.
- Fetching data into shared buffers asynchronously and concurrently with query
  execution means there is more CPU time for query execution.


## High level difficulties adding AIO/DIO support

- Optionally using AIO leads to convoluted and / or duplicated code.

- Platform dependency: The common AIO APIs are typically specific to one
  platform (linux AIO, linux io_uring, windows IOCP, windows overlapped IO) or
  a few platforms (posix AIO, but there's many differences).

- There are a lot of separate places doing IO in PG. Moving all of these to
  use efficiently use AIO is an, um, large undertaking.

- Nothing in the buffer management APIs expects there to be more than one IO
  to be in progress at the same time - which is required to do AIO.


## Portability & Duplication

To avoid the issue of needing non-AIO codepaths to support platforms without
native AIO support a worker process based AIO implementation exists (and is
currently the default). This also is convenient to check if a problem is
related to the native IO implementation or not.

Thanks to Thomas Munro for helping a *lot* around this area. He wrote
the worker mode, the posix aio mode, added CI, did a lot of other
testing, listened to me...


## Deadlock and Starvation Dangers due to AIO

Using AIO in a naive way can easily lead to deadlocks in an environment where
the source/target of AIO are shared resources, like pages in postgres'
shared_buffers.

Consider one backend performing readahead on a table, initiating IO for a
number of buffers ahead of the current "scan position". If that backend then
performs some operation that blocks, or even just is slow, the IO completion
for the asynchronously initiated read may not be processed.

This AIO implementation solves this problem by requiring that AIO methods
either allow AIO completions to be processed by any backend in the system
(e.g. io_uring, and indirectly posix, via signal handlers), or to guarantee
that AIO processing will happen even when the issuing backend is blocked
(e.g. worker mode, which offloads completion processing to the AIO workers).


## AIO API overview

The main steps to use AIO (without higher level helpers) are:

1) acquire an "unused" AIO: pgaio_io_get()

2) start some IO, this is done by functions like
   pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)

   The (read|write|fsync|flush_range) indicates the operation, whereas
   (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.

   (see below for more details about this design choice - it might or not be
   right)

3) optionally: assign a backend-local completion callback to the IO
   (pgaio_io_on_completion_local())

4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
   put on a per-backend list of pending IOs. The pending IOs can be explicitly
   be flushed pgaio_submit_pending(), but will also be submitted if the
   pending list gets to be too large, or if the current backend waits for the
   IO.

   The are two main reasons not to submit the IO immediately:
   - If adjacent, we can merge several IOs into one "kernel level" IO during
     submission. Larger IOs are considerably more efficient.
   - Several AIO APIs allow to submit a batch of IOs in one system call.

5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
   backend. When other backends may need to wait for an IO to finish,
   pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
   BufferDesc), which can be waited for using pgaio_io_wait_ref().

6) Process the results of the request. If a callback was registered in 3),
   this isn't always necessary. The results of AIO can be accessed using
   pgaio_io_result() which returns an integer where negative numbers are
   -errno, and positive numbers are the [partial] success conditions
   (e.g. potentially indicating a short read).

7) release ownership of the io (pgaio_io_release()) or reuse the IO for
   another operation (pgaio_io_recycle())


Most places that want to use AIO shouldn't themselves need to care about
managing the number of writes in flight, or the readahead distance. To help
with that there are two helper utilities, a "streaming read" and a "streaming
write".

The "streaming read" helper uses a callback to determine which blocks to
prefetch - that allows to do readahead in a sequential fashion but importantly
also allows to asynchronously "read ahead" non-sequential blocks.

E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
uses the tids in LVDeadTuples to figure out which blocks are going to be
needed. Here's the latter as an example:

https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906


## IO initialization layering

One difficulty I had in this process was how to initialize IOs in light of the
layering (from bufmgr.c over smgr.c and md.c to fd.c and back, but also
e.g. xlog.c). Sometimes AIO needs to be initialized on the bufmgr.c level,
sometimes on the md.c level, sometimes on the level of fd.c. But to be able to
react to the completion of any such IO metadata about the operation is needed.

Early on fd.c initialized IOs, and the context information was just passed
through to fd.c. But that seems quite wrong - fd.c shouldn't have to know
about which Buffer an IO is about. But higher levels shouldn't know about
which files an operation resides in either, so they can't do all the work
either...

To avoid that, I ended up splitting the "start an AIO" operation into a higher
level part, e.g. pgaio_io_start_read_smgr() - which doesn't know about which
smgr implementation is in use and thus also not what file/offset we're dealing
with, which calls into smgr->md->fd to actually "prepare" the IO (i.e. figure
out file / offset).  This currently looks like:

void
pgaio_io_start_read_smgr(PgAioInProgress *io, struct SMgrRelationData* smgr, ForkNumber forknum,
                         BlockNumber blocknum, char *bufdata)
{
    pgaio_io_prepare(io, PGAIO_OP_READ);

    smgrstartread(io, smgr, forknum, blocknum, bufdata);

    io->scb_data.read_smgr.tag = (AioBufferTag){
        .rnode = smgr->smgr_rnode,
        .forkNum = forknum,
        .blockNum = blocknum
    };

    pgaio_io_stage(io, PGAIO_SCB_READ_SMGR);
}

Once this reaches the fd.c layer the new FileStartRead() function calls
pgaio_io_prep_read() on the IO - but doesn't need to know anything about weird
higher level stuff like relfilenodes.

The _sb (_sb for shared_buffers) variant stores the Buffer, backend and mode
(as in ReadBufferMode).


I'm not sure this is the right design - but it seems a lot better than what I
had earlier...


## Callbacks

In the core AIO pieces there are two different types of callbacks at the
moment:

Shared callbacks, which can be invoked by any backend (normally the issuing
backend / the AIO workers, but can be other backends if they are waiting for
the IO to complete). For operations on shared resources (e.g. shared buffer
reads/writes, or WAL writes) these shared callback needs to transition the
state of the object the IO is being done for to completion. E.g. for a shared
buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS.

The main reason these callbacks exist is that they make it safe for a backend
to issue non-blocking IO on buffers (see the deadlock section above). As any
blocked backend can cause the IO to complete, the deadlock danger is gone.


Local callbacks, one of which the issuer of an IO can associate with the
IO. These can be used to issue further readahead. I initially did not have
these, but I found it hard to have a controllable numbers of IO in
flight. They are currently mainly used for error handling (e.g. erroring out
when XLogFileInit() cannot create the file due to ENOSPC), and to issue more
IO (e.g. readahead for heapam).

The local callback system isn't quite right, and there's


## AIO conversions

Currently the patch series converts a number of subsystems to AIO. They are of
very varying quality. I mainly did the conversions that I considered either be
of interest architecturally, or that caused a fair bit of pain due to slowness
(e.g. VACUUMing without AIO is no fun at all when using DIO). Some also for
fun ;)

Most conversions are fairly simple. E.g. heap scans, checkpointer, bgwriter,
VACUUM are all not too complicated.

There are two conversions that are good bit more complicated/experimental:

1) Asynchronous, concurrent, WAL writes. This is important because we right
   now are very bottlenecked by IO latency, because there effectively only
   ever is one WAL IO in flight at the same time. Even though in many
   situations it is possible to issue a WAL write, have one [set of] backends
   wait for that write as its completions satisfies their XLogFlush() needs,
   but concurrently already issue the next WAL write(s) that other backends
   need.

   The code here is very crufty, but I think the concept is mostly right.

2) Asynchronous buffer replacement. Even with buffered IO we experience a lot
   of pain when ringbuffers need to write out data (VACUUM!). But with DIO the
   issue gets a lot worse - the kernel can't hide the write latency from us
   anymore.  This change makes each backend asynchronously clean out buffers
   that it will need soon. When a ringbuffer is is use this means cleaning out
   buffers in the ringbuffer, when not, performing the clock sweep and cleaning
   out victim buffers.  Due to 1) the XLogFlush() can also be done
   asynchronously.


There are a *lot* of places that haven't been converted to use AIO.


## Stats

There are two new views: pg_stat_aios showing AIOs that are currently
in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.


## Code:

https://github.com/anarazel/postgres/tree/aio

I was not planning to attach all the patches on the way to AIO - it's too many
right now... I hope I can reduce the size of the series iteratively into
easier to look at chunks.


## TL;DR: Performance numbers

This is worth an email on its own, and it's pretty late here already and I
want to rerun benchmarks before posting more numbers. So here are just a few
that I could run before falling asleep.


1) 60s of parallel COPY BINARY of a 89MB into separate tables (s_b = 96GB):

slow NVMe SSD
branch     dio  clients   tps/stddev    checkpoint write time
master     n    8        3.0/2296 ms    4.1s / 379647 buffers = 723MiB/s
aio     n    8        3.8/1985 ms    11.5s / 1028669 buffers = 698MiB/
aio     y    8        4.7/204 ms    10.0s / 1164933 buffers = 910MiB/s

raid of 2 fast NVMe SSDs (on pcie3):
branch     dio  clients   tps/stddev    checkpoint write time
master     n    8        9.7/62 ms    7.6s / 1206376 buffers = 1240MiB/s
aio     n    8        11.4/82 ms    14.3s / 2838129 buffers = 1550MiB/s
aio     y    8        18.1/56 ms    8.9s / 4486170 buffers = 3938MiB/s


2) pg prewarm speed

raid of 2 fast NVMe SSDs (on pcie3):

pg_prewarm(62GB, read)
branch     dio    time        bw
master     n    17.4s        3626MiB/s
aio     n    10.3s        6126MiB/s (higher cpu usage)
aio     y    9.8s        6438MiB/s

pg_prewarm(62GB, buffer)
branch     dio    time        bw
master     n    38.3s        1647MiB/s
aio     n    13.6s        4639MiB/s (higher cpu usage)
aio     y    10.7s        5897MiB/s



3) parallel sequential scan speed

parallel sequential scan + count(*) of 59GB table:

branch     dio        max_parallel    time
master     n        0            40.5s
master     n        1            22.6s
master     n        2            16.4s
master     n        4            10.9s
master     n        8            9.3s

aio     y        0            33.1s
aio     y        1            17.2s
aio     y        2            11.8s
aio     y        4            9.0s
aio     y        8            9.2s



On local SSDs there's some, but not a huge performance advantage in most
transactional r/w workloads. But on cloud storage - which has a lot higher
latency - AIO can yield huge advantages. I've seen over 4x.


There's definitely also cases where AIO currently hurts - most of those I just
didn't get aroung to address.

There's a lot more cases in which DIO currently hurts - mostly because the
necessary smarts haven't yet been added.


Comments? Questions?

I plan to send separate emails about smaller chunks of this seperately -
the whole topic is just too big. In particular I plan to send something
around buffer locking / state management - it's a one of the core issues
around this imo.


Regards,

Andres



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Greg Stark
Date:
On Tue, 23 Feb 2021 at 05:04, Andres Freund <andres@anarazel.de> wrote:
>
> ## Callbacks
>
> In the core AIO pieces there are two different types of callbacks at the
> moment:
>
> Shared callbacks, which can be invoked by any backend (normally the issuing
> backend / the AIO workers, but can be other backends if they are waiting for
> the IO to complete). For operations on shared resources (e.g. shared buffer
> reads/writes, or WAL writes) these shared callback needs to transition the
> state of the object the IO is being done for to completion. E.g. for a shared
> buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS.
>
> The main reason these callbacks exist is that they make it safe for a backend
> to issue non-blocking IO on buffers (see the deadlock section above). As any
> blocked backend can cause the IO to complete, the deadlock danger is gone.

So firstly this is all just awesome work and I have questions but I
don't want them to come across in any way as criticism or as a demand
for more work. This is really great stuff, thank you so much!

The callbacks make me curious about two questions:

1) Is there a chance that a backend issues i/o, the i/o completes in
some other backend and by the time this backend gets around to looking
at the buffer it's already been overwritten again? Do we have to
initiate I/O again or have you found a way to arrange that this
backend has the buffer pinned from the time the i/o starts even though
it doesn't handle the comletion?

2) Have you made (or considered making) things like sequential scans
(or more likely bitmap index scans) asynchronous at a higher level.
That is, issue a bunch of asynchronous i/o and then handle the pages
and return the tuples as the pages arrive. Since sequential scans and
bitmap scans don't guarantee to read the pages in order they're
generally free to return tuples from any page in any order. I'm not
sure how much of a win that would actually be since all the same i/o
would be getting executed and the savings in shared buffers would be
small but if there are mostly hot pages you could imagine interleaving
a lot of in-memory pages with the few i/os instead of sitting idle
waiting for the async i/o to return.



> ## Stats
>
> There are two new views: pg_stat_aios showing AIOs that are currently
> in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.

This is impressive. How easy is it to correlate with system aio stats?


-- 
greg



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

On 2021-02-23 14:58:32 -0500, Greg Stark wrote:
> So firstly this is all just awesome work and I have questions but I
> don't want them to come across in any way as criticism or as a demand
> for more work.

I posted it to get argued with ;).


> The callbacks make me curious about two questions:
>
> 1) Is there a chance that a backend issues i/o, the i/o completes in
> some other backend and by the time this backend gets around to looking
> at the buffer it's already been overwritten again? Do we have to
> initiate I/O again or have you found a way to arrange that this
> backend has the buffer pinned from the time the i/o starts even though
> it doesn't handle the comletion?

The initiator of the IO can just keep a pin for the buffer to prevent
that.

There's a lot of complexity around how to handle pinning and locking
around asynchronous buffer IO. I plan to send a separate email with a
more details.

In short: I made it so that for shared buffer IO holds a separate
refcount for the duration of the IO - that way the issuer can release
its own pin without causing a problem (consider e.g. an error while an
IO is in flight). The pin held by the IO gets released in the completion
callback.  There's similar trickery with locks - but more about that later.


> 2) Have you made (or considered making) things like sequential scans
> (or more likely bitmap index scans) asynchronous at a higher level.
> That is, issue a bunch of asynchronous i/o and then handle the pages
> and return the tuples as the pages arrive. Since sequential scans and
> bitmap scans don't guarantee to read the pages in order they're
> generally free to return tuples from any page in any order. I'm not
> sure how much of a win that would actually be since all the same i/o
> would be getting executed and the savings in shared buffers would be
> small but if there are mostly hot pages you could imagine interleaving
> a lot of in-memory pages with the few i/os instead of sitting idle
> waiting for the async i/o to return.

I have not. Mostly because it'd basically break the entire regression
test suite. And probably a lot of user expectations (yes,
synchronize_seqscan exists, but it pretty rarely triggers).

I'm not sure how big the win would be - the readahead for heapam.c that
is in the patch set tries to keep ahead of the "current scan position"
by a certain amount - so it'll issue the reads for the "missing" pages
before they're needed. Hopefully early enough to avoid unnecessary
stalls. But I'm pretty sure there'll be cases where that'd require a
prohibitively long "readahead distance".

I think there's a lot of interesting things along these lines that we
could tackle, but since they involve changing results and/or larger
changes to avoid those (e.g. detecting when sequential scan order isn't
visible to the user) I think it'd make sense to separate them from the
aio patchset itself.

If we had the infrastructure to detect whether seqscan order matters, we
could also switch to the tuple-iteration order to "backwards" in
heapgetpage() - right now iterating forward in "itemid order" causes a
lot of cache misses because the hardware prefetcher doesn't predict that
the tuples are layed out "decreasing pointer order".


> > ## Stats
> >
> > There are two new views: pg_stat_aios showing AIOs that are currently
> > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.
>
> This is impressive. How easy is it to correlate with system aio stats?

Could you say a bit more about what you are trying to correlate?

Here's some example IOs from pg_stat_aios.

┌─[ RECORD 1 ]─┬───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ backend_type │ client backend                                                                                    │
│ id           │ 98                                                                                                │
│ gen          │ 13736                                                                                             │
│ op           │ write                                                                                             │
│ scb          │ sb                                                                                                │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                            │
│ ring         │ 5                                                                                                 │
│ owner_pid    │ 1866051                                                                                           │
│ merge_with   │ (null)                                                                                            │
│ result       │ 0                                                                                                 │
│ desc         │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 238576         │

├─[ RECORD 24 ]┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ checkpointer                                                                                  │
│ id           │ 1501                                                                                          │
│ gen          │ 15029                                                                                         │
│ op           │ write                                                                                         │
│ scb          │ sb                                                                                            │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                        │
│ ring         │ 3                                                                                             │
│ owner_pid    │ 1865275                                                                                       │
│ merge_with   │ 1288                                                                                          │
│ result       │ 0                                                                                             │
│ desc         │ fd: 24, offset: 105136128, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 202705     │

├─[ RECORD 31 ]┼───────────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ walwriter                                                                                         │
│ id           │ 90                                                                                                │
│ gen          │ 26498                                                                                             │
│ op           │ write                                                                                             │
│ scb          │ wal                                                                                               │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                            │
│ ring         │ 5                                                                                                 │
│ owner_pid    │ 1865281                                                                                           │
│ merge_with   │ 181                                                                                               │
│ result       │ 0                                                                                                 │
│ desc         │ write_no: 17, fd: 12, offset: 6307840, nbytes: 1048576, already_done: 0, bufdata: 0x7f94dd670000  │
└──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘


And the per-backend AIO view shows information like this (too wide to
display all cols):

SELECT
    pid, last_context, backend_type,
    executed_total_count,issued_total_count,submissions_total_count, inflight_count
FROM pg_stat_aio_backends sab JOIN pg_stat_activity USING (pid);

┌─────────┬──────────────┬──────────────────────────────┬──────────────────────┬────────────────────┬─────────────────────────┬────────────────┐
│   pid   │ last_context │         backend_type         │ executed_total_count │ issued_total_count │
submissions_total_count│ inflight_count │
 

├─────────┼──────────────┼──────────────────────────────┼──────────────────────┼────────────────────┼─────────────────────────┼────────────────┤
│ 1865291 │            3 │ logical replication launcher │                    0 │                  0 │
   0 │              0 │
 
│ 1865296 │            0 │ client backend               │                   85 │                 85 │
  85 │              0 │
 
│ 1865341 │            7 │ client backend               │              9574416 │            2905321 │
345642│              0 │
 
...
│ 1873501 │            3 │ client backend               │                 3565 │               3565 │
 467 │             10 │
 
...
│ 1865277 │            0 │ background writer            │               695402 │             575906 │
13513│              0 │
 
│ 1865275 │            3 │ checkpointer                 │              4664110 │            3488530 │
1399896│              0 │
 
│ 1865281 │            3 │ walwriter                    │                77203 │               7759 │
7747 │              3 │
 

└─────────┴──────────────┴──────────────────────────────┴──────────────────────┴────────────────────┴─────────────────────────┴────────────────┘

It's not super obvious at this point but executed_total_count /
issued_total_count shows the rate at which IOs have been
merged. issued_total_count / submissions_total_count shows how many
(already merged) IOs were submitted together in one "submission batch"
(for io_method=io_uring, io_uring_enter()).


So pg_stat_aios should allow to enrich some system stats - e.g. by being
able to split out WAL writes from data file writes. And - except that
the current callbacks for that aren't great - it should even allow to
split the IO by different relfilenodes etc.


I assume pg_stat_aio_backends also can be helpful, e.g. by seeing which
backends currently the deep IO queues that cause latency delays in other
backends, and which backends do a lot of sequential IO (high
executed_total_count / issued_total_count) and which only random...

Greetings,

Andres Freund



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Thomas Munro
Date:
On Tue, Feb 23, 2021 at 11:03 PM Andres Freund <andres@anarazel.de> wrote:
> over the last ~year I spent a lot of time trying to figure out how we could
> add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
> there's still a *lot* of open questions, I think I now have a decent handle on
> most of the bigger architectural questions.  Thus this long email.

Hello,

Very cool to see this project escaping onto -hackers!

I have done some work on a couple of low level parts of it, and I
wanted to show a quick "hey, where'd my system calls go?" demo, which
might help illustrate some very simple things about this stuff.  Even
though io_uring is the new hotness in systems programming, I'm going
to use io_mode=worker here.  It's the default in the current patch
set, it works on all our supported OSes and is easier to understand
without knowledge of shiny new or obscure old AIO system interfaces.
I'll also use io_workers=1, an artificially low setting to make it
easy to spy on (pseudo) async I/O with strace/truss/dtruss on a single
process, and max_parallel_workers_per_gather=0 to keep executor
parallelism from confusing matters.

The first thing to notice is that there's an "io worker" process, and
while filling up a big table with "insert into t select
generate_series(1, 100000000)", it's doing a constant stream of 128KB
pwritev() calls.  These are writing out 16 blocks from shared buffers
at a time:

  pwritev(44, [{iov_base=..., iov_len=73728},
               {iov_base=..., iov_len=24576},
               {iov_base=..., iov_len=32768}], 3, 228032512) = 131072

The reason there are 3 vectors there rather than 16 is just that some
of the buffers happened to be adjacent in memory and we might as well
use the smallest number of vectors.  Just after we've started up and
the buffer pool is empty, it's easy to find big single vector I/Os,
but things soon get more fragmented (blocks adjacent on disk become
less likely to be adjacent in shared buffers) and that number goes up,
but that shouldn't make much difference to the OS or hardware assuming
decent scatter/gather support through the stack.  If io_data_direct=on
(not the default) and the blocks are in one physical extent on the
file system, that might even go all the way down to the disk as a
single multi-segment write command for the storage hardware DMA engine
to beam directly in/out of our buffer pool without CPU involvement.

Mixed into that stream of I/O worker system calls, you'll also see WAL
going out to disk:

  pwritev(15, [{iov_base=..., iov_len=1048576}], 1, 4194304) = 1048576

Meanwhile, the user session process running the big INSERT can be seen
signalling the I/O worker to wake it up.  The same thing happens for
bgwriter, checkpointer, autovacuum and walwriter: you can see them all
handing off most of their I/O work to the pool of I/O workers, with a
bit of new signalling going on (which we try to minimise, and can
probably minimise much more).  (You might be able to see some evidence
of Andres's new buffer cleaning scheme too, which avoids some bad
patterns of interleaving small reads and writes, but I'm skipping
right over here...)

Working through a very simple example of how the I/O comes to be
consolidated and parallelised, let's look at a simple non-parallel
SELECT COUNT(*) query on a large table.  The I/O worker does a stream
of scattered reads into our buffer pool:

  preadv(51, [{iov_base=..., iov_len=24576},
              {iov_base=..., iov_len=8192},
              {iov_base=..., iov_len=16384},
              {iov_base=..., iov_len=16384},
              {iov_base=..., iov_len=16384},
              {iov_base=..., iov_len=49152}], 6, 190808064) = 131072

Meanwhile our user session backend can be seen waking it up whenever
it's trying to start I/O and finds it snoozing:

  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0

Notice that there are no sleeping system calls in the query backend,
meaning the I/O in this example is always finished by the time the
executor gets around to accessing the page it requested, so we're
staying far enough ahead and we can be 100% CPU bound.  In unpatched
PostgreSQL we'd hope to have no actual sleeping in such a simple case
anyway, thanks to the OS's readahead heuristics; but (1) we'd still do
individual pread(8KB) calls, meaning that the user's query is at least
having to pay the CPU cost of a return trip into the kernel and a
copyout of 8KB from kernel space to user space, here avoided, (2) in
io_data_direct=on mode, there's no page cache and thus no kernel read
ahead, so we need to replace that mechanism with something anyway, (3)
it's needed for non-sequential access like btree scans.

Sometimes I/Os are still run in user backends, for example because (1)
existing non-AIO code paths are still reached, (2) in worker mode,
some kinds of I/Os can't be handed off to another process due to lack
of a way to open some fds or because we're in single process mode, (3)
because a heuristic kicks in when we know there's only one I/O to run
and we know we'll immediately wait for it and we can skip a lot of
communication with a traditional synchronous syscall (worker mode only
for no, needs to be done for others).

In order to be able to generate a stream of big vectored reads/writes,
and start them far enough ahead of time that they're finished before
we need the data, there are several layers of new instructure that
Andres already mentioned and can explain far better than I, but super
briefly:

heapam.c uses a "pg_streaming_read" object (aio_util.c) to get buffers
to scan, instead of directly calling ReadBuffer().  It gives the
pg_streaming_read a callback of its own, so that heapam.c remains in
control of what is read, but the pg_streaming_read is in control of
readahead distance and also "submission".  heapam.c's callback calls
ReadBufferAsync() to initiate reads of pages that it will need soon,
which it does with pgaio_io_start_read_sb() if there's a cache miss.
This results in 8KB reads queued up in the process's pending I/O list,
with pgaio_read_sb_complete as the completion function to run when
each read has eventually completed.  When the pending list grows to a
certain length, it is submitted by pg_streaming_read code.  That
involves first "combining" pending I/Os: this is where read/write of
adjacent ranges of files are merged into larger I/Os up to a limit.
Then the I/Os are submitted to the OS, and we'll eventually learn
about their completion, via io_method-specific means (see
aio_worker.c, aio_uring.c, aio_posix.c and one day probably also
aio_win32.c).  At that point, merged I/Os will be uncombined.
Skipping over some complication about retrying on some kinds of
failure/partial I/O, that leads to ReadBufferCompleteWrite() being
called for each buffer.  (Far be it from me to try to explain the
rather complex interlocking required to deal with pins and locks
between ReadBufferAsync() and ReadBufferCompleteWrite() in
(potentially) another process while the I/O is in progress, at this
stage.)

Places in the tree that want to do carefully controlled I/O depth
management can consider using pg_streaming_{read,write}, providing
their own callback to do the real work (though it's not necessary, and
not all AIO uses suit the "streaming" model).  There's also the
traditional PrefetchBuffer() mechanism, which can still be used to
initiate buffer reads as before.  It's comparatively primitive; since
you don't know when the I/O completes, you have to use conservative
models as I do in my proposed WAL prefetching patch.  That patch (like
probably many others like CF #2799) works just fine on top of the AIO
branch, with some small tweaks: it happily shifts all I/O system calls
out of the recovery process, so that instead of calling
posix_fadvise() and then a bit later pread() for each cold page
accessed, it makes one submission system call for every N cold pages
(or, in some cases, no system calls at all).  A future better
integration would probably use pg_streaming_read for precise control
of the I/O depth instead of the little LSN queue it currently uses,
but I haven't tried to write that yet.

If you do simple large INSERTs and SELECTs with one of the native
io_method settings instead of worker mode, it'd be much the same, in
terms of most of the architecture.  The information in the pg_stat_XXX
views is almost exactly the same.  There are two major differences:
(1) the other methods have no I/O worker processes, because the kernel
manages the I/O (or in some unfortunate cases runtime libraries fake
it with threads), (2) the "shared completion callbacks" (see
aio_scb.c) are run by I/O workers in worker mode, but are run by
whichever process "drains" the I/O in the other modes.  That is,
initiating processes never hear about I/Os completing from the
operating system, they just eventually wait on them and find that
they're already completed (though they do run the "local callback" if
there there is one, which is for example the point at which eg
pg_streaming_read might initiate more I/O), or alternatively see that
they aren't, and wait on a condition variable for an I/O worker to
signal completion.  So far this seems like a good choice...

Hope that helped show off a couple of features of this scheme.



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Alexey Lesovsky
Date:
Hi,

Thank you for the amazing and great work.

On 23.02.2021 15:03, Andres Freund wrote:
> ## Stats
>
> There are two new views: pg_stat_aios showing AIOs that are currently
> in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.

As a DBA I would like to propose a few amendments that might help with 
practical usage of stats when feature will be finally implemented. My 
suggestions aren’t related to the central idea of the proposed changes, 
but rather to the stats part.

A quick side note, there are two terms in Prometheus 
(https://prometheus.io/docs/concepts/metric_types/):
1. Counter. A counter is a cumulative metric that represents a single 
monotonically increasing counter whose value can only increase or be 
reset to zero on restart.
2. Gauge. A gauge is a metric that represents a single numerical value 
that can arbitrarily go up and down.

For the purposes of long-term stats collection, COUNTERs are preferred 
over GAUGEs, because COUNTERs allow us to understand how metrics are 
changed overtime without missing out potential spikes in activity. As a 
result, we have a much better historic perspective.

Measuring and collecting GAUGEs is limited to the moments in time when 
the stats are taken (snapshots) so the changes that took place between 
the snapshots remain unmeasured. In systems with a high rate of 
transactions per second (even 1 second interval between the snapshots) 
GAUGEs measuring won’t provide the full picture.  In addition, most of 
the monitoring systems like Prometheus, Zabbix, etc. use longer 
intervals (from 10-15 to 60 seconds).

The main idea is to try to expose almost all numeric stats as COUNTERs - 
this increases overall observabilty of implemented feature.

pg_stat_aios.
In general, this stat is a set of text values, and at the same time it 
looks GAUGE-like (similar to pg_stat_activity or pg_locks), and is only 
relevant for the moment when the user is looking at it. I think it would 
be better to rename this view to pg_stat_progress_aios. And keep 
pg_stat_aios for other AIO stats with global COUNTERs (like stuff in 
pg_stat_user_tables or pg_stat_statements, or system-wide /proc/stat, 
/proc/diskstats).

pg_stat_aio_backends.
This stat is based on COUNTERs, which is great, but the issue here is 
that its lifespan is limited by the lifespan of the backend processes - 
once the backend exits the stat will no longer be available - which 
could be inappropriate in workloads with short-lived backends.

I think there might be few existing examples in the current code that 
could be repurposed to implement the suggestions above (such as 
pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think 
having these changes incorporated shouldn’t take significant effort 
considering the benefit it will bring to the final user.

Once again huge respect to your work on this changes and good look.

Regards, Alexey




Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Greg Stark
Date:
I guess what I would be looking for in stats would be a way to tell
what the bandwidth, latency, and queue depth is. Ideally one day
broken down by relation/index and pg_stat_statement record.

I think seeing the actual in flight async requests in a connection is
probably not going to be very useful in production. It's very low
level and in production the user is just going to find that
overwhelming detail. It is kind of cool to see the progress in
sequential operations but I think that should be solved in a higher
level way than this anyways.

What we need to calculate these values would be the kinds of per-op
stats nfsiostat uses from /proc/self/mountstats:
https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsNFSOps

So number of async reads we've initiated, how many callbacks have been
called, total cumulative elapsed time between i/o issued and i/o
completed, total bytes of i/o initiated, total bytes of i/o completed.
As well a counter of requests which returned errors (eof? i/o error?)
If there are other locks or queues internally to postgres total time
spent in those states.

I have some vague idea that we should have a generic infrastructure
for stats that automatically counts things associated with plan nodes
and automatically bubbles that data up to the per-transaction,
per-backend, per-relation, and pg_stat_statements stats. But that's a
whole other ball of wax :)



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Dmitry Dolgov
Date:
> On Tue, Feb 23, 2021 at 02:03:44AM -0800, Andres Freund wrote:
>
> over the last ~year I spent a lot of time trying to figure out how we could
> add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
> there's still a *lot* of open questions, I think I now have a decent handle on
> most of the bigger architectural questions.  Thus this long email.
>
> Just to be clear: I don't expect the current to design to survive as-is. If
> there's a few sentences below that sound a bit like describing the new world,
> that's because they're from the README.md in the patch series...

Thanks!

> Comments? Questions?
>
> I plan to send separate emails about smaller chunks of this seperately -
> the whole topic is just too big. In particular I plan to send something
> around buffer locking / state management - it's a one of the core issues
> around this imo.

I'm curious about control knobs for this feature, it's somewhat related
to the stats questions also discussed in this thread. I guess most
important of those are max_aio_in_flight, io_max_concurrency etc, and
they're going to be a hard limits, right? I'm curious if it makes sense
to explore possibility to have these sort of "backpressure", e.g. if
number of inflight requests is too large calculate inflight_limit a bit
lower than possible (to avoid hard performance deterioration when the db
is trying to do too much IO, and rather do it smooth). From what I
remember io_uring does have something similar only for SQPOLL. Another
similar question if this could be used for throttling of some overloaded
workers in case of misconfigured clients or such?



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

On 2021-02-24 21:15:14 +0500, Alexey Lesovsky wrote:
> pg_stat_aios.
> In general, this stat is a set of text values, and at the same time it looks
> GAUGE-like (similar to pg_stat_activity or pg_locks), and is only relevant
> for the moment when the user is looking at it. I think it would be better to
> rename this view to pg_stat_progress_aios. And keep pg_stat_aios for other
> AIO stats with global COUNTERs (like stuff in pg_stat_user_tables or
> pg_stat_statements, or system-wide /proc/stat, /proc/diskstats).

Right - arguably it really shouldn't even have _stat_ in the name... I
don't particularly like the idea of adding _progress_ as that seems it'd
lead to confusing it with pg_stat_progress_vacuum etc - and it's quite a
different beast.


> pg_stat_aio_backends.
> This stat is based on COUNTERs, which is great, but the issue here is that
> its lifespan is limited by the lifespan of the backend processes - once the
> backend exits the stat will no longer be available - which could be
> inappropriate in workloads with short-lived backends.

There's a todo somewhere to roll over the per-connection stats into a
global stats piece on disconnect. In addition I was thinking of adding a
view that sums up the value of "already disconnected backends" and the
currently connected ones.  Would that mostly address your concerns?


> I think there might be few existing examples in the current code that could
> be repurposed to implement the suggestions above (such as
> pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think
> having these changes incorporated shouldn’t take significant effort
> considering the benefit it will bring to the final user.

Yea - I kind of was planning to go somewhere roughly in the direction
you suggest, but took a few shortcuts due to the size of the
project. Having the views made it a lot easier to debug / develop, but
supporting longer lived stuff wasn't yet crucial. But I agree, we really
should have it...

Greetings,

Andres Freund



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

On 2021-02-24 14:59:19 -0500, Greg Stark wrote:
> I guess what I would be looking for in stats would be a way to tell
> what the bandwidth, latency, and queue depth is. Ideally one day
> broken down by relation/index and pg_stat_statement record.

I think doing it at that granularity will likely be too expensive...


> I think seeing the actual in flight async requests in a connection is
> probably not going to be very useful in production.

I think it's good for analyzing concrete performance issues, but
probably not that much more. Although, it's not too hard to build
sampling based on top of it with a tiny bit of work (should display the
relfilenode etc).


> So number of async reads we've initiated, how many callbacks have been
> called, total cumulative elapsed time between i/o issued and i/o
> completed, total bytes of i/o initiated, total bytes of i/o completed.

Much of that is already in pg_stat_aio_backends - but is lost after
disconnect (easy to solve). We don't track bytes of IO currently, but
that'd not be hard.

However, it's surprisingly hard to do the measurement between "issued"
and "completed" in a meaningful way. It's obviously not hard to measure
the time at which the request was issued, but there's no real way to
determine the time at which it was completed. If a backend is busy doing
other things (e.g. invoke aggregate transition functions), we'll not see
the completion immediately, and therefore not have an accurate
timestamp.

With several methods of doing AIO we can set up signals that fire on
completion, but that's pretty darn expensive. And it's painful to write
such signal handlers in a safe way.


> I have some vague idea that we should have a generic infrastructure
> for stats that automatically counts things associated with plan nodes
> and automatically bubbles that data up to the per-transaction,
> per-backend, per-relation, and pg_stat_statements stats. But that's a
> whole other ball of wax :)

Heh, yea, let's tackle that separately ;)

Greetings,

Andres Freund



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

On 2021-02-24 21:41:16 +0100, Dmitry Dolgov wrote:
> I'm curious about control knobs for this feature, it's somewhat related
> to the stats questions also discussed in this thread. I guess most
> important of those are max_aio_in_flight, io_max_concurrency etc, and
> they're going to be a hard limits, right?

Yea - there's a lot more work needed in that area.

io_max_concurrency especially really should be a GUC, I was just too
lazy for that so far.


> I'm curious if it makes sense
> to explore possibility to have these sort of "backpressure", e.g. if
> number of inflight requests is too large calculate inflight_limit a bit
> lower than possible (to avoid hard performance deterioration when the db
> is trying to do too much IO, and rather do it smooth).

It's decidedly nontrivial to compute "too large" - and pretty workload
dependant (e.g. lower QDs are better latency sensitive OLTP, higher QD
is better for bulk r/w heavy analytics). So I don't really want to go
there for now - the project is already very large.

What I do think is needed and feasible (there's a bunch of TODOs in the
code about it already) is to be better at only utilizing deeper queues
when lower queues don't suffice. So we e.g. don't read ahead more than a
few blocks for a scan where the query is spending most of the time
"elsewhere.

There's definitely also some need for a bit better global, instead of
per-backend, control over the number of IOs in flight. That's not too
hard to implement - the hardest probably is to avoid it becoming a
scalability issue.

I think the area with the most need for improvement is figuring out how
we determine the queue depths for different things using IO. Don't
really want to end up with 30 parameters influencing what queue depth to
use for (vacuum, index builds, sequential scans, index scans, bitmap
heap scans, ...) - but they benefit from a deeper queue will differ
between places.


> From what I remember io_uring does have something similar only for
> SQPOLL. Another similar question if this could be used for throttling
> of some overloaded workers in case of misconfigured clients or such?

You mean dynamically? Or just by setting the concurrency lower for
certain users? I think doing so dynamically is way too complicated for
now. But I'd expect configuring it on a per-user basis or such to be a
reasonable thing. That might require splitting it into two GUCs - one
SUSET one and a second one that's settable by any user, but can only
lower the depth.

I think it'll be pretty useful to e.g. configure autovacuum to have a
low queue depth instead of using the current cost limiting. That way the
impact on the overall system is limitted, but it's not slowed down
unnecessarily as much.

Greetings,

Andres Freund



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Dmitry Dolgov
Date:
> On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote:
>
> > I'm curious if it makes sense
> > to explore possibility to have these sort of "backpressure", e.g. if
> > number of inflight requests is too large calculate inflight_limit a bit
> > lower than possible (to avoid hard performance deterioration when the db
> > is trying to do too much IO, and rather do it smooth).
>
> What I do think is needed and feasible (there's a bunch of TODOs in the
> code about it already) is to be better at only utilizing deeper queues
> when lower queues don't suffice. So we e.g. don't read ahead more than a
> few blocks for a scan where the query is spending most of the time
> "elsewhere.
>
> There's definitely also some need for a bit better global, instead of
> per-backend, control over the number of IOs in flight. That's not too
> hard to implement - the hardest probably is to avoid it becoming a
> scalability issue.
>
> I think the area with the most need for improvement is figuring out how
> we determine the queue depths for different things using IO. Don't
> really want to end up with 30 parameters influencing what queue depth to
> use for (vacuum, index builds, sequential scans, index scans, bitmap
> heap scans, ...) - but they benefit from a deeper queue will differ
> between places.

Yeah, sounds like an interesting opportunity for improvements. I'm
preparing few benchmarks to understand better how this all works, so
will keep this in mind.

> > From what I remember io_uring does have something similar only for
> > SQPOLL. Another similar question if this could be used for throttling
> > of some overloaded workers in case of misconfigured clients or such?
>
> You mean dynamically? Or just by setting the concurrency lower for
> certain users? I think doing so dynamically is way too complicated for
> now. But I'd expect configuring it on a per-user basis or such to be a
> reasonable thing. That might require splitting it into two GUCs - one
> SUSET one and a second one that's settable by any user, but can only
> lower the depth.
>
> I think it'll be pretty useful to e.g. configure autovacuum to have a
> low queue depth instead of using the current cost limiting. That way the
> impact on the overall system is limitted, but it's not slowed down
> unnecessarily as much.

Yes, you got it right, not dynamically, but rather expose this to be
configured on e.g. per-user basis.



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Alexey Lesovsky
Date:
Hi,

On 25.02.2021 02:03, Andres Freund wrote:
>> pg_stat_aio_backends.
>> This stat is based on COUNTERs, which is great, but the issue here is that
>> its lifespan is limited by the lifespan of the backend processes - once the
>> backend exits the stat will no longer be available - which could be
>> inappropriate in workloads with short-lived backends.
> There's a todo somewhere to roll over the per-connection stats into a
> global stats piece on disconnect. In addition I was thinking of adding a
> view that sums up the value of "already disconnected backends" and the
> currently connected ones.  Would that mostly address your concerns?

Yes, approach with separated stats for live and disconnected backends 
looks good and solves problem with "stats loss".

Or it can be done like a stats for shared objects in pg_stat_databases, 
where there is a special NULL database is used.

Regards, Alexey



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Dmitry Dolgov
Date:
Sorry for another late reply, finally found some time to formulate couple of
thoughts.

> On Thu, Feb 25, 2021 at 09:22:43AM +0100, Dmitry Dolgov wrote:
> > On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote:
> >
> > > I'm curious if it makes sense
> > > to explore possibility to have these sort of "backpressure", e.g. if
> > > number of inflight requests is too large calculate inflight_limit a bit
> > > lower than possible (to avoid hard performance deterioration when the db
> > > is trying to do too much IO, and rather do it smooth).
> >
> > What I do think is needed and feasible (there's a bunch of TODOs in the
> > code about it already) is to be better at only utilizing deeper queues
> > when lower queues don't suffice. So we e.g. don't read ahead more than a
> > few blocks for a scan where the query is spending most of the time
> > "elsewhere.
> >
> > There's definitely also some need for a bit better global, instead of
> > per-backend, control over the number of IOs in flight. That's not too
> > hard to implement - the hardest probably is to avoid it becoming a
> > scalability issue.
> >
> > I think the area with the most need for improvement is figuring out how
> > we determine the queue depths for different things using IO. Don't
> > really want to end up with 30 parameters influencing what queue depth to
> > use for (vacuum, index builds, sequential scans, index scans, bitmap
> > heap scans, ...) - but they benefit from a deeper queue will differ
> > between places.

Talking about parameters, from what I understand the actual number of queues
(e.g. io_uring) created is specified by PGAIO_NUM_CONTEXTS, shouldn't it be
configurable? Maybe in fact there should be not that many knobs after all - if
the model assumes the storage has:

* Some number of hardware queues, then the number of queues AIO implementation
  needs to use depends on it. For example, lowering number of contexts between
  different benchmark runs I could see that some of the hardware queues were
  significantly underutilized. Potentially there could be also such
  thing as too many contexts.

* Certain bandwidth, then the submit batch size (io_max_concurrency or
  PGAIO_SUBMIT_BATCH_SIZE) depends on it. This will allow to distinguish
  attached storage with high bandwidth and high latency vs local storages.

From what I see max_aio_in_flight is used as a queue depth for contexts, which
is workload dependent and not easy to figure out as you mentioned. To avoid
having 30 different parameters maybe it's more feasible to introduce "shallow"
and "deep" queues, where particular depth for those could be derived from depth
of hardware queues. The question which activity should use which queue is not
easy, but if I get it right from queuing theory (assuming IO producers are
stationary processes and fixed IO latency from the storage) it depends on IO
arrivals distribution in every particular case and this in turn could be
roughly estimated for each type of activity. One can expect different IO
arrivals distributions for e.g. a normal point-query backend and a checkpoint
or vacuum process, no matter what are the other conditions (collecting those
for few benchmark runs gives indeed pretty distinct distributions).

If I understand correctly, those contexts defined by PGAIO_NUM_CONTEXTS are the
main working horse, right? I'm asking because there is also something called
local_ring, but it seems there are no IOs submitted into those. Assuming that
contexts are a main way of submitting IO, it would be also interesting to
explore isolated for different purposes contexts. I haven't finished yet my
changes here to give any results, but at least doing some tests with fio show
different latencies, when two io_urings are processing mixed read/writes vs
isolated read or writes. On the side note, at the end of the day there are so
many queues - application queue, io_uring, mq software queue, hardware queue -
I'm really curious if it would amplify tail latencies.

Another thing I've noticed is AIO implementation is much more significantly
affected by side IO activity than synchronous one. E.g. AIO version tps drops
from tens of thousands to a couple of hundreds just because of some kworker
started to flush dirty buffers (especially with disabled writeback throttling),
while synchronous version doesn't suffer that much. Not sure what to make of
it. Btw, overall I've managed to get better numbers from AIO implementation on
IO bounded test cases with local NVME device, but non IO bounded were mostly a
bit slower - is it expected, or am I missing something?

Interesting thing to note is that io_uring implementation apparently relaxed
requirements for polling operations, now one needs to have only CAP_SYS_NICE
capability, not CAP_SYS_ADMIN. I guess theoretically there are no issues using
it within the current design?



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Melanie Plageman
Date:
On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> ## AIO API overview
>
> The main steps to use AIO (without higher level helpers) are:
>
> 1) acquire an "unused" AIO: pgaio_io_get()
>
> 2) start some IO, this is done by functions like
>    pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
>
>    The (read|write|fsync|flush_range) indicates the operation, whereas
>    (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
>
>    (see below for more details about this design choice - it might or not be
>    right)
>
> 3) optionally: assign a backend-local completion callback to the IO
>    (pgaio_io_on_completion_local())
>
> 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
>    put on a per-backend list of pending IOs. The pending IOs can be explicitly
>    be flushed pgaio_submit_pending(), but will also be submitted if the
>    pending list gets to be too large, or if the current backend waits for the
>    IO.
>
>    The are two main reasons not to submit the IO immediately:
>    - If adjacent, we can merge several IOs into one "kernel level" IO during
>      submission. Larger IOs are considerably more efficient.
>    - Several AIO APIs allow to submit a batch of IOs in one system call.
>
> 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
>    backend. When other backends may need to wait for an IO to finish,
>    pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
>    BufferDesc), which can be waited for using pgaio_io_wait_ref().
>
> 6) Process the results of the request. If a callback was registered in 3),
>    this isn't always necessary. The results of AIO can be accessed using
>    pgaio_io_result() which returns an integer where negative numbers are
>    -errno, and positive numbers are the [partial] success conditions
>    (e.g. potentially indicating a short read).
>
> 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
>    another operation (pgaio_io_recycle())
>
>
> Most places that want to use AIO shouldn't themselves need to care about
> managing the number of writes in flight, or the readahead distance. To help
> with that there are two helper utilities, a "streaming read" and a "streaming
> write".
>
> The "streaming read" helper uses a callback to determine which blocks to
> prefetch - that allows to do readahead in a sequential fashion but importantly
> also allows to asynchronously "read ahead" non-sequential blocks.
>
> E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
> to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
> uses the tids in LVDeadTuples to figure out which blocks are going to be
> needed. Here's the latter as an example:
>
https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
>

Attached is a patch on top of the AIO branch which does bitmapheapscan
prefetching using the PgStreamingRead helper already used by sequential
scan and vacuum on the AIO branch.

The prefetch iterator is removed and the main iterator in the
BitmapHeapScanState node is now used by the PgStreamingRead helper.

Some notes about the code:

Each IO will have its own TBMIterateResult allocated and returned by the
PgStreamingRead helper and freed later by
heapam_scan_bitmap_next_block() before requesting the next block.
Previously it was allocated once and saved in the TBMIterator in the
BitmapHeapScanState node and reused. Because of this, the table AM API
routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
as an output parameter.

The PgStreamingRead helper pgsr_private parameter for BitmapHeapScan is
now the actual BitmapHeapScanState node. It needed access to the
iterator, the heap scan descriptor, and a few fields in the
BitmapHeapScanState node that could be moved elsewhere or duplicated
(visibility map buffer and can_skip_fetch, for example). So, it is
possible to either create a new struct or move fields around to avoid
this--but, I'm not sure if that would actually be better.

Because the PgStreamingReadHelper needs to be set up with the
BitmapHeapScanState node but also needs some table AM specific
functions, I thought it made more sense to initialize it using a new
table AM API routine. Instead of fully implementing that I just wrote a
wrapper function, table_bitmap_scan_setup() which just calls
bitmapheap_pgsr_alloc() to socialize the idea before implementing it.

I haven't made the GIN code reasonable yet either (it uses the TID
bitmap functions that I've changed).

There are various TODOs in the code posing questions both to the
reviewer and myself for future versions of the patch.

Oh, also, I haven't updated the failing partition_prune regression test
because I haven't had a chance to look at the EXPLAIN code which adds
the text which is not being produced to see if it is actually a bug in
my code or not.

Oh, and I haven't done testing to see how effective the prefetching is
-- that is a larger project that I have yet to tackle.

- Melanie

Attachment

Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote:
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.

Cool! I'm heartened to see "12 files changed, 272 insertions(+), 495 deletions(-)"


It's worth calling out that this fixes some abstraction leakyness around
tableam too...


> Each IO will have its own TBMIterateResult allocated and returned by the
> PgStreamingRead helper and freed later by
> heapam_scan_bitmap_next_block() before requesting the next block.
> Previously it was allocated once and saved in the TBMIterator in the
> BitmapHeapScanState node and reused. Because of this, the table AM API
> routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
> as an output parameter.
>
> I haven't made the GIN code reasonable yet either (it uses the TID
> bitmap functions that I've changed).

I don't quite understand the need to change the tidbitmap interface, or
maybe rather I'm not convinced that pessimistically preallocating space
is a good idea?


> I don't see a need for it right now. If you wanted you
> Because the PgStreamingReadHelper needs to be set up with the
> BitmapHeapScanState node but also needs some table AM specific
> functions, I thought it made more sense to initialize it using a new
> table AM API routine. Instead of fully implementing that I just wrote a
> wrapper function, table_bitmap_scan_setup() which just calls
> bitmapheap_pgsr_alloc() to socialize the idea before implementing it.

That makes sense.


>  static bool
>  heapam_scan_bitmap_next_block(TableScanDesc scan,
> -                              TBMIterateResult *tbmres)
> +                              TBMIterateResult **tbmres)

ISTM that we possibly shouldn't expose the TBMIterateResult outside of
the AM after this change? It feels somewhat like an implementation
detail now. It seems somewhat odd to expose a ** to set a pointer that
nodeBitmapHeapscan.c then doesn't really deal with itself.


> @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
>       * Create the TBMIterator struct, with enough trailing space to serve the
>       * needs of the TBMIterateResult sub-struct.
>       */
> -    iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
> -                                      MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
> +    iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
>      iterator->tbm = tbm;

Hm?


> diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
> index 9a07f06b9f..8e1aa48827 100644
> --- a/src/include/storage/aio.h
> +++ b/src/include/storage/aio.h
> @@ -39,7 +39,7 @@ typedef enum IoMethod
>  } IoMethod;
>
>  /* We'll default to bgworker. */
> -#define DEFAULT_IO_METHOD IOMETHOD_WORKER
> +#define DEFAULT_IO_METHOD IOMETHOD_IO_URING

I agree with the sentiment, but ... :)

Greetings,

Andres Freund



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Melanie Plageman
Date:

On Wed, Jul 28, 2021 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote:
>
> > Each IO will have its own TBMIterateResult allocated and returned by the
> > PgStreamingRead helper and freed later by
> > heapam_scan_bitmap_next_block() before requesting the next block.
> > Previously it was allocated once and saved in the TBMIterator in the
> > BitmapHeapScanState node and reused. Because of this, the table AM API
> > routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
> > as an output parameter.
> >
> > I haven't made the GIN code reasonable yet either (it uses the TID
> > bitmap functions that I've changed).
>
> I don't quite understand the need to change the tidbitmap interface, or
> maybe rather I'm not convinced that pessimistically preallocating space
> is a good idea?
>

TBMIterator cannot contain a TBMIterateResult because it prefetches
blocks and calls tbm_iterate() for each one, which would overwrite the
relevant information in the TBMIterateResult before it has been returned
to heapam_scan_bitmap_next_block().*

Thus, we need at least as many TBMIterateResults as the size of the
prefetch window at its largest.

We could save some memory if we separated the data in TBMIterateResult
and made a new struct, let's call it BitmapBlockState, with just the
block number, buffer number, and recheck to be used and returned by
bitmapheapscan_pgsr_next_single().

We need both block and buffer because we need to distinguish between
hit_end, skip_fetch, and invalid block number conditions in the caller.
We need recheck before initiating IO to determine if we should
skip_fetch.

Then a separate struct which is much the same as the existing
TBMIterateResult could be maintained in the BitmapHeapScanState node and
passed into heapam_scan_bitmap_next_block() along with the bitmap (a new
parameter).

In heapam_scan_bitmap_next_block(), after getting the BitmapBlockState
from pg_streaming_read_get_next(), we could call tbm_find_pageentry()
with the block number and bitmap.
For a non-lossy page, we could then scrape the offsets and ntuples using
the PageTableEntry. If it is lossy, we would set recheck and ntuples
accordingly. (I do wonder if that allows us to distinguish between a
lossy page and a block number that is erroneous and isn't in the
bitmap--but maybe that can't happen.)

However, we would still have as many palloc() calls (one for every block
to create the BitmapBlockState. We would have less outstanding memory by
limiting the number of offsets arrays created.
We would still need to pass the recheck flag, ntuples, and buffer back
up to BitmapHeapNext(), so, at that point we would still need a data
structure that is basically the same as the existing TBMIterateResult.

Alternatively, we could keep an array of TBMIterateResults the size of
the prefetch window and reuse them -- though I'm not sure where to keep
it and how to manage it when the window gets resized.

In my current patch, I allocate and free one TBMIterateResult for each
block. The amount of outstanding memory will be #ios in prefetch window
* sizeof(TBMIterateResult).

We don't want to always palloc() memory for the TBMIterateResult inside
of tbm_iterate(), since other users (like GIN) still only need one
TBMIterateResult

So, if the TBMIterateResult is not inside of the TBMIterator and
tbm_iterate() does not allocate the memory, we need to pass it in as an
output parameter, and, if we do that, it felt odd to also return it --
hence the function signature change.

One alternative I tried was having the TBMIterator have a pointer to the
TBMIterateResult and then users of it can allocate the TBMIterateResult
and set it in the TBMIterator before calling tbm_iterate(). But, then we
need to expose the TBMIterator outside of the TID bitmap API. Also, it
felt weird to have a member of the iterator which must not be NULL when
tbm_iterate() is called but which isn't set up in tbm_begin_iterate().

>
>
> >  static bool
> >  heapam_scan_bitmap_next_block(TableScanDesc scan,
> > -                                                       TBMIterateResult *tbmres)
> > +                              TBMIterateResult **tbmres)
>
> ISTM that we possibly shouldn't expose the TBMIterateResult outside of
> the AM after this change? It feels somewhat like an implementation
> detail now. It seems somewhat odd to expose a ** to set a pointer that
> nodeBitmapHeapscan.c then doesn't really deal with itself.
>

All the members of the TBMIterateResult are populated in
bitmapheapscan_pgsr_next_single() and then
most of it is used by heapam_scan_bitmap_next_block() to
  - detect error conditions and done-ness
  - fill in the HeapScanDesc with the information needed by
    heapam_scan_bitmap_next_tuple() (rs_cbuf [which is basically
    redundant with TBMIterateResult->buffer] and rs_vistuples)

However, some of the information is used up in BitmapHeapNext() and in
heapam_scan_bitmap_next_tuple() and doesn't go in the HeapScanDesc:
  - BitmapHeapNext() uses the state of the TBMIterateResult to determine
    if the bitmap is exhausted, since the return value of
    table_scan_bitmap_next_block() indicates an error condition and not
    done-ness
  - BitmapHeapNext() uses recheck to determine whether or not to
    recheck qual conditions
  - heapam_scan_bitmap_next_tuple() uses the validity of the buffer to
    determine if it should return empty tuples
  - heapam_scan_bitmap_next_tuple() uses ntuples to determine how many
    empty tuples to return

So, if we don't want to pass around a TBMIterateResult, we would have to
1) change the return value of heapam_scan_bitmap_next_block() and 2)
find another appropriate place for the information above (or another way
to represent the encoded information).

It is also worth noting that heapam_scan_bitmap_next_tuple() took a
TBMIterateResult before without using it, so I assume your foresaw other
table AMs using it?

Overall, the whole thing still feels a bit yucky to me. It doesn't quite
feel like the right things are in the right places, but, I haven't put
my finger on the culprit.

I do think putting the buffer in the TBMIterateResult is an
inappropriate addition to the TID Bitmap API.

Also, I would like to move this code:

if (node->tbmres->ntuples >= 0)
node->exact_pages++;
else
node->lossy_pages++;

from where it is in BitmapHeapNext(). It seems odd that that is the only
part of BitmapHeapNext() that reaches inside of the TBMIterateResult.
Also, as it is, it is incorrect--it doesn't count the first page. I
could duplicate it under the first call to
table_scan_bitmap_next_block(), but I wasn't looking forward to doing
so.

>
> > @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
> >        * Create the TBMIterator struct, with enough trailing space to serve the
> >        * needs of the TBMIterateResult sub-struct.
> >        */
> > -     iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
> > -                                                                       MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
> > +     iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
> >       iterator->tbm = tbm;
>
> Hm?
>

I removed the TBMIterateResult from the TBMIterator, so, we should no
longer  allocate memory for the offsets array when creating the
TBMIterator.

* I think that having TBMIterateResult inside of TBMIterator is not
  well-defined C language behavior. In [1], it says

  "Structures with flexible array members (or unions who have a
  recursive-possibly structure member with flexible array member) cannot
  appear as array elements or as members of other structures."

[1] https://en.cppreference.com/w/c/language/struct

Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

On 2021-07-30 15:35:30 -0400, Melanie Plageman wrote:
> * I think that having TBMIterateResult inside of TBMIterator is not
>   well-defined C language behavior. In [1], it says
> 
>   "Structures with flexible array members (or unions who have a
>   recursive-possibly structure member with flexible array member) cannot
>   appear as array elements or as members of other structures."

> [1] https://en.cppreference.com/w/c/language/struct

I think it is ok as long as the struct with the flexible array member is
at the end of the struct it is embedded in. I think even by the letter
of the standard, but it's as always hard to parse...

Greetings,

Andres Freund



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Thomas Munro
Date:
On Sat, Jul 31, 2021 at 7:52 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-07-30 15:35:30 -0400, Melanie Plageman wrote:
> > * I think that having TBMIterateResult inside of TBMIterator is not
> >   well-defined C language behavior. In [1], it says
> >
> >   "Structures with flexible array members (or unions who have a
> >   recursive-possibly structure member with flexible array member) cannot
> >   appear as array elements or as members of other structures."
>
> > [1] https://en.cppreference.com/w/c/language/struct
>
> I think it is ok as long as the struct with the flexible array member is
> at the end of the struct it is embedded in. I think even by the letter
> of the standard, but it's as always hard to parse...

That's clearly the de facto situation (I think that was the case on
the most popular compilers long before flexible array members were
even standardised), but I think it might technically still be not
allowed since this change has not yet been accepted AFAICS:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2083.htm

In any case, we already do it which is why wrasse (Sun Studio
compiler) warns about indkey in pg_index.h.  Curiously, indkey is not
always the final member of the containing struct, depending on
CATALOG_VARLEN...



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> In any case, we already do it which is why wrasse (Sun Studio
> compiler) warns about indkey in pg_index.h.  Curiously, indkey is not
> always the final member of the containing struct, depending on
> CATALOG_VARLEN...

Hm?  CATALOG_VARLEN is never to be defined, see genbki.h.

            regards, tom lane



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Melanie Plageman
Date:
On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > ## AIO API overview
> >
> > The main steps to use AIO (without higher level helpers) are:
> >
> > 1) acquire an "unused" AIO: pgaio_io_get()
> >
> > 2) start some IO, this is done by functions like
> >    pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
> >
> >    The (read|write|fsync|flush_range) indicates the operation, whereas
> >    (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
> >
> >    (see below for more details about this design choice - it might or not be
> >    right)
> >
> > 3) optionally: assign a backend-local completion callback to the IO
> >    (pgaio_io_on_completion_local())
> >
> > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
> >    put on a per-backend list of pending IOs. The pending IOs can be explicitly
> >    be flushed pgaio_submit_pending(), but will also be submitted if the
> >    pending list gets to be too large, or if the current backend waits for the
> >    IO.
> >
> >    The are two main reasons not to submit the IO immediately:
> >    - If adjacent, we can merge several IOs into one "kernel level" IO during
> >      submission. Larger IOs are considerably more efficient.
> >    - Several AIO APIs allow to submit a batch of IOs in one system call.
> >
> > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
> >    backend. When other backends may need to wait for an IO to finish,
> >    pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
> >    BufferDesc), which can be waited for using pgaio_io_wait_ref().
> >
> > 6) Process the results of the request. If a callback was registered in 3),
> >    this isn't always necessary. The results of AIO can be accessed using
> >    pgaio_io_result() which returns an integer where negative numbers are
> >    -errno, and positive numbers are the [partial] success conditions
> >    (e.g. potentially indicating a short read).
> >
> > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
> >    another operation (pgaio_io_recycle())
> >
> >
> > Most places that want to use AIO shouldn't themselves need to care about
> > managing the number of writes in flight, or the readahead distance. To help
> > with that there are two helper utilities, a "streaming read" and a "streaming
> > write".
> >
> > The "streaming read" helper uses a callback to determine which blocks to
> > prefetch - that allows to do readahead in a sequential fashion but importantly
> > also allows to asynchronously "read ahead" non-sequential blocks.
> >
> > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
> > to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
> > uses the tids in LVDeadTuples to figure out which blocks are going to be
> > needed. Here's the latter as an example:
> >
https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
> >
>
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.
>
...
>
> Oh, and I haven't done testing to see how effective the prefetching is
> -- that is a larger project that I have yet to tackle.
>

I have done some testing on how effective it is now.

I've also updated the original patch to count the first page (in the
lossy/exact page counts mentioned down-thread) as well as to remove
unused prefetch fields and comments.
I've also included a second patch which adds IO wait time information to
EXPLAIN output when used like:
  EXPLAIN (buffers, analyze) SELECT ...

The same commit also introduces a temporary dev GUC
io_bitmap_prefetch_depth which I am using to experiment with the
prefetch window size.

I wanted to share some results from changing the prefetch window to
demonstrate how prefetching is working.

The short version of my results is that the prefetching works:

- with the prefetch window set to 1, the IO wait time is 1550 ms
- with the prefetch window set to 128, the IO wait time is 0.18 ms

DDL and repro details below:

On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set
applied built with the following build flags:
-02 -fno-omit-frame-pointer --with-liburing

And these non-default PostgreSQL settings:
  io_data_direct=1
  io_data_force_async=off
  io_method=io_uring
  log_min_duration_statement=0
  log_duration=on
  set track_io_timing to on;

  set max_parallel_workers_per_gather to 0;
  set enable_seqscan to off;
  set enable_indexscan to off;
  set enable_bitmapscan to on;

  set effective_io_concurrency to 128;
  set io_bitmap_prefetch_depth to 128;

Using this DDL:

drop table if exists bar;
create table bar(a int, b text, c text, d int);
create index bar_idx on bar(a);
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,1000)i;
insert into bar select i%3, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,200)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%2000, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,10000000)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,2000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i;
analyze;

And this query:

select * from bar where a > 100 offset 10000000000000;

with the prefetch window set to 1,
the query execution time is:
5496.129 ms

and IO wait time is:
1550.915

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
                                                      QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
   Buffers: shared hit=1 read=280571
   I/O Timings: read=1315.845 wait=1550.915
   ->  Bitmap Heap Scan on bar  (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
         Recheck Cond: (a > 100)
         Rows Removed by Index Recheck: 400281
         Heap Blocks: exact=47915 lossy=197741
         Buffers: shared hit=1 read=280571
         I/O Timings: read=1315.845 wait=1550.915
         ->  Bitmap Index Scan on bar_idx  (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
               Index Cond: (a > 100)
               Buffers: shared hit=1 read=34915
               I/O Timings: read=1315.845
 Planning:
   Buffers: shared hit=96 read=30
   I/O Timings: read=3.399
 Planning Time: 4.378 ms
 Execution Time: 5473.404 ms
(18 rows)

with the prefetch window set to 128,
the query execution time is:
3222 ms

and IO wait time is;
0.178 ms

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
   Buffers: shared hit=1 read=280571
   I/O Timings: read=1339.795 wait=0.178
   ->  Bitmap Heap Scan on bar  (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
         Recheck Cond: (a > 100)
         Rows Removed by Index Recheck: 400281
         Heap Blocks: exact=47915 lossy=197741
         Buffers: shared hit=1 read=280571
         I/O Timings: read=1339.795 wait=0.178
         ->  Bitmap Index Scan on bar_idx  (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
               Index Cond: (a > 100)
               Buffers: shared hit=1 read=34915
               I/O Timings: read=1339.795
 Planning:
   Buffers: shared hit=96 read=30
   I/O Timings: read=3.488
 Planning Time: 4.279 ms
 Execution Time: 3434.522 ms
(18 rows)

- Melanie

[1] https://github.com/anarazel/postgres/tree/aio

Attachment

Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

Attached is an updated patch AIO series. The major changes are:
- rebased onto master (Andres)
- lots of progress on posix AIO backend (Thomas)
- lots of progress towards a windows native AIO implementation - not yet quite
  merged (Thomas & David)
- considerably improved "worker" io_method (Thomas)
- some preliminary patches merged (Thomas) and thus dropped
- error handling overhaul, AIO references now use resource owners
- quite a few more localized bugfixes
- further CI improvements

Unfortunately there's a few tests that don't pass on windows. At least some of
those failures also happen on master - hence the alternative output file added
in the last commit.

Thanks to Thomas there's now a new wiki page for AIO support:
https://wiki.postgresql.org/wiki/AIO
It's currently mostly a shared todo list....

My own next steps are to try to get some of the preliminary patches merged
into master, and to address some of the higher level things that aren't yet
quite right with the AIO interface, and to split the "main" AIO patch into
smaller patches.

I hope that we soon send in a new version with native AIO support for
windows. I'm mostly interested in that to make sure that we get the shared
infrastructure right.

Melanie has some work improving bitmap heap scan AIO support and some IO stats
/ explain improvements.

I think a decent and reasonably simple example for the way the AIO interface
can be used to do faster IO is
v3-0028-aio-Use-AIO-in-nbtree-vacuum-scan.patch.gz which adds AIO for nbtree
vacuum. It's not perfectly polished, but I think it shows that it's not too
hard to add AIO usage to individual once the general infrastructure is in
place.

I've attached the code for posterity, but the series is large enough that I
don't think it makes sense to do that all that often... The code is at
https://github.com/anarazel/postgres/tree/aio

Greetings,

Andres Freund

Attachment

Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Thomas Munro
Date:
On Wed, Sep 1, 2021 at 5:57 PM Andres Freund <andres@anarazel.de> wrote:
> - lots of progress on posix AIO backend (Thomas)

A quick note on this piece:  Though it's still a work in progress with
a few things that need to be improved, I've tested this on a whole lot
of different OSes now.  I originally tried to use realtime signals
(big mistake), but after a couple of reworks I think it's starting to
look plausible and quite portable.  Of the ~10 or so OSes we support
and test in the build farm, ~8 of them have this API, and of those I
have only one unknown: HPUX (I have no access and I am beginning to
suspect it is an ex-parrot), and one mysteriously-doesn't-work: NetBSD
(I'd be grateful for any clues from NetBSD gurus and happy to provide
build/test instructions if anyone would like to take a look).



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Julien Rouhaud
Date:
Hi,

On Wed, Sep 1, 2021 at 1:57 PM Andres Freund <andres@anarazel.de> wrote:
>
> I've attached the code for posterity, but the series is large enough that I
> don't think it makes sense to do that all that often...

Agreed.

> The code is at
> https://github.com/anarazel/postgres/tree/aio

Just FYI the cfbot says that this version of the patchset doesn't
apply anymore, and it seems that your branch was only rebased to
43c1c4f (Sept. 21th) which doesn't rebase cleanly:

error: could not apply 8a20594f2f... lwlock, xlog: Report caller wait
event for LWLockWaitForVar.

Since it's still a WIP and a huge patchset I'm not sure if I should
switch the cf entry to Waiting on Author or not as it's probably going
to rot quite fast anyway.  Just to be safe I'll go ahead and change
the status.  If that's unhelpful just let me know and I'll switch it
back to needs review, as people motivated enough to review the patch
can still work with 43c1c4f as a starting point.



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Aleksander Alekseev
Date:
Hi Andres,

> > The code is at
> > https://github.com/anarazel/postgres/tree/aio
>
> Just FYI the cfbot says that this version of the patchset doesn't
> apply anymore, and it seems that your branch was only rebased to
> 43c1c4f (Sept. 21th) which doesn't rebase cleanly:

After watching your recent talk "IO in PostgreSQL: Past, Present,
Future" [1] I decided to invest some of my time into this patchset. It
looks like at very least it could use a reviewer, or maybe two :)
Unfortunately, it's a bit difficult to work with the patchset at the
moment. Any chance we may expect a rebased version for the July CF?

> Comments? Questions?

Personally, I'm very enthusiastic about this patchset. However, a set
of 39 patches seems to be unrealistic to test and/or review and/or
keep up to date. The 64 bit XIDs patchset [2] is much less
complicated, but still it got the feedback that it should be splitted
to more patches and CF entries. Any chance we could decompose this
effort?

For instance, I doubt that we need all the backends in the first
implementation. The fallback "worker" one, and io_uring one will
suffice. Other backends can be added as separate features. Considering
that in any case the "worker" backend shouldn't cause any significant
performance degradation, maybe we could start even without io_uring.
BTW, do we need Posix AIO at all, given your feedback on this API?

Also, what if we migrate to AIO/DIO one part of the system at a time?
As I understood from your talk, sequential scans will benefit most
from AIO/DIO. Will it be possible to improve them first, while part of
the system will continue using buffered IO?

[1]: https://www.youtube.com/watch?v=3Oj7fBAqVTw
[2]: https://commitfest.postgresql.org/38/3594/


-- 
Best regards,
Aleksander Alekseev



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Bharath Rupireddy
Date:
On Wed, Sep 1, 2021 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Attached is an updated patch AIO series. The major changes are:

Hi Andres, is there a plan to get fallocate changes alone first? I think fallocate API can help parallel inserts work (bulk relation extension currently writes zero filled-pages) and make pre-padding while allocating WAL files faster.

Regards,
Bharath Rupireddy.

Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Michael Paquier
Date:
On Tue, Aug 31, 2021 at 10:56:59PM -0700, Andres Freund wrote:
> I've attached the code for posterity, but the series is large enough that I
> don't think it makes sense to do that all that often... The code is at
> https://github.com/anarazel/postgres/tree/aio

I don't know what's the exact status here, but as there has been no
activity for the past five months, I have just marked the entry as RwF
for now.
--
Michael

Attachment

Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Andres Freund
Date:
Hi,

On 2022-10-12 14:45:26 +0900, Michael Paquier wrote:
> On Tue, Aug 31, 2021 at 10:56:59PM -0700, Andres Freund wrote:
> > I've attached the code for posterity, but the series is large enough that I
> > don't think it makes sense to do that all that often... The code is at
> > https://github.com/anarazel/postgres/tree/aio
> 
> I don't know what's the exact status here, but as there has been no
> activity for the past five months, I have just marked the entry as RwF
> for now.

We're trying to get a number of smaller prerequisite patches merged this CF
(aligned alloc, direction IO, dclist, bulk relation extension, ...). Once
that's done I'm planning to send out a new version of the (large) remainder of
the changes.

Greetings,

Andres Freund



Re: Asynchronous and "direct" IO support for PostgreSQL.

From
Wenjing Zeng
Date:


2021年9月1日 13:56,Andres Freund <andres@anarazel.de> 写道:

Hi,

Attached is an updated patch AIO series. The major changes are:
- rebased onto master (Andres)
- lots of progress on posix AIO backend (Thomas)
- lots of progress towards a windows native AIO implementation - not yet quite
 merged (Thomas & David)
- considerably improved "worker" io_method (Thomas)
- some preliminary patches merged (Thomas) and thus dropped
- error handling overhaul, AIO references now use resource owners
- quite a few more localized bugfixes
- further CI improvements

Unfortunately there's a few tests that don't pass on windows. At least some of
those failures also happen on master - hence the alternative output file added
in the last commit.

Thanks to Thomas there's now a new wiki page for AIO support:
https://wiki.postgresql.org/wiki/AIO
It's currently mostly a shared todo list....

My own next steps are to try to get some of the preliminary patches merged
into master, and to address some of the higher level things that aren't yet
quite right with the AIO interface, and to split the "main" AIO patch into
smaller patches.

I hope that we soon send in a new version with native AIO support for
windows. I'm mostly interested in that to make sure that we get the shared
infrastructure right.

Melanie has some work improving bitmap heap scan AIO support and some IO stats
/ explain improvements.

I think a decent and reasonably simple example for the way the AIO interface
can be used to do faster IO is
v3-0028-aio-Use-AIO-in-nbtree-vacuum-scan.patch.gz which adds AIO for nbtree
vacuum. It's not perfectly polished, but I think it shows that it's not too
hard to add AIO usage to individual once the general infrastructure is in
place.

I've attached the code for posterity, but the series is large enough that I
don't think it makes sense to do that all that often... The code is at
https://github.com/anarazel/postgres/tree/aio

HI Andres:

I noticed this feature and did some testing.

code in GitHub's aio branch:

Function

static void pgaio_write_smgr_retry(PgAioInProgress *io) { uint32 off; AioBufferTag *tag = &io->scb_data.write_smgr.tag; SMgrRelation reln = smgropen(tag->rlocator.locator, tag->rlocator.backend); io->op_data.read.fd = smgrfd(reln, tag->forkNum, tag->blockNum, &off); Assert(off == io->op_data.read.offset); }


seems should to be: io->op_data.write.fd = smgrfd(reln, tag->forkNum, tag->blockNum, &off); Assert(off == io->op_data.write.offset);



Best regards,

Wenjing



Greetings,

Andres Freund
<v3-0001-windows-Only-consider-us-to-be-running-as-service.patch.gz><v3-0002-WIP-Fix-non-aio-bug-leading-to-checkpointer-not-s.patch.gz><v3-0003-aio-WIP-align-PGAlignedBlock-to-page-size.patch.gz><v3-0004-Add-allocator-support-for-larger-allocation-align.patch.gz><v3-0005-ilist.h-debugging-improvements.patch.gz><v3-0006-lwlock-xlog-Report-caller-wait-event-for-LWLockWa.patch.gz><v3-0007-heapam-Don-t-re-inquire-block-number-for-each-tup.patch.gz><v3-0008-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch.gz><v3-0009-heapam-WIP-cacheline-prefetching-for-hot-pruning.patch.gz><v3-0010-WIP-Change-instr_time-to-just-store-nanoseconds-t.patch.gz><v3-0011-aio-Add-some-error-checking-around-pinning.patch.gz><v3-0012-aio-allow-lwlocks-to-be-unowned.patch.gz><v3-0013-condvar-add-ConditionVariableCancelSleepEx.patch.gz><v3-0014-Use-a-global-barrier-to-fix-DROP-TABLESPACE-on-Wi.patch.gz><v3-0015-pg_buffercache-Add-pg_buffercache_stats.patch.gz><v3-0016-bufmgr-Add-LockBufHdr-fastpath.patch.gz><v3-0017-lwlock-WIP-add-extended-locking-functions.patch.gz><v3-0018-io-Add-O_DIRECT-non-buffered-IO-mode.patch.gz><v3-0019-io-Increase-default-ringbuffer-size.patch.gz><v3-0020-Use-aux-process-resource-owner-in-walsender.patch.gz><v3-0021-Ensure-a-resowner-exists-for-all-paths-that-may-p.patch.gz><v3-0022-aio-Add-asynchronous-IO-infrastructure.patch.gz><v3-0023-aio-Use-AIO-in-pg_prewarm.patch.gz><v3-0024-aio-Use-AIO-in-bulk-relation-extension.patch.gz><v3-0025-aio-Use-AIO-in-checkpointer-bgwriter.patch.gz><v3-0026-aio-bitmap-heap-scan-Minimal-and-hacky-improvemen.patch.gz><v3-0027-aio-Use-AIO-in-heap-vacuum-s-lazy_scan_heap-and-l.patch.gz><v3-0028-aio-Use-AIO-in-nbtree-vacuum-scan.patch.gz><v3-0029-aio-Use-AIO-for-heap-table-scans.patch.gz><v3-0030-aio-Use-AIO-in-SyncDataDirectory.patch.gz><v3-0031-aio-Use-AIO-in-ProcessSyncRequests.patch.gz><v3-0032-aio-wal-concurrent-WAL-flushes.patch.gz><v3-0033-wal-Use-LWLockAcquireOrWait-in-AdvanceXLInsertBuf.patch.gz><v3-0034-wip-wal-async-commit-reduce-frequency-of-latch-se.patch.gz><v3-0035-aio-wal-padding-of-partial-records.patch.gz><v3-0036-aio-wal-extend-pg_stat_wal.patch.gz><v3-0037-aio-initial-sketch-for-design-document.patch.gz><v3-0038-aio-CI-and-README.md.patch.gz><v3-0039-XXX-Add-temporary-workaround-for-partition_prune-.patch.gz>