Home > mailing lists

Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers

From	Melanie Plageman
Subject	Re: Asynchronous and "direct" IO support for PostgreSQL.
Date	July 28, 2021 20:37:48
Msg-id	CAAKRu_a4CiCH+Rxw39LVoCJMRP9Ns1rg+_vyQz4rhrMmLHZAvQ@mail.gmail.com Whole thread Raw
In response to	Asynchronous and "direct" IO support for PostgreSQL. (Andres Freund <andres@anarazel.de>)
Responses	Re: Asynchronous and "direct" IO support for PostgreSQL. Re: Asynchronous and "direct" IO support for PostgreSQL.
List	pgsql-hackers

Tree view

On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> ## AIO API overview
>
> The main steps to use AIO (without higher level helpers) are:
>
> 1) acquire an "unused" AIO: pgaio_io_get()
>
> 2) start some IO, this is done by functions like
>    pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
>
>    The (read|write|fsync|flush_range) indicates the operation, whereas
>    (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
>
>    (see below for more details about this design choice - it might or not be
>    right)
>
> 3) optionally: assign a backend-local completion callback to the IO
>    (pgaio_io_on_completion_local())
>
> 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
>    put on a per-backend list of pending IOs. The pending IOs can be explicitly
>    be flushed pgaio_submit_pending(), but will also be submitted if the
>    pending list gets to be too large, or if the current backend waits for the
>    IO.
>
>    The are two main reasons not to submit the IO immediately:
>    - If adjacent, we can merge several IOs into one "kernel level" IO during
>      submission. Larger IOs are considerably more efficient.
>    - Several AIO APIs allow to submit a batch of IOs in one system call.
>
> 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
>    backend. When other backends may need to wait for an IO to finish,
>    pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
>    BufferDesc), which can be waited for using pgaio_io_wait_ref().
>
> 6) Process the results of the request. If a callback was registered in 3),
>    this isn't always necessary. The results of AIO can be accessed using
>    pgaio_io_result() which returns an integer where negative numbers are
>    -errno, and positive numbers are the [partial] success conditions
>    (e.g. potentially indicating a short read).
>
> 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
>    another operation (pgaio_io_recycle())
>
>
> Most places that want to use AIO shouldn't themselves need to care about
> managing the number of writes in flight, or the readahead distance. To help
> with that there are two helper utilities, a "streaming read" and a "streaming
> write".
>
> The "streaming read" helper uses a callback to determine which blocks to
> prefetch - that allows to do readahead in a sequential fashion but importantly
> also allows to asynchronously "read ahead" non-sequential blocks.
>
> E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
> to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
> uses the tids in LVDeadTuples to figure out which blocks are going to be
> needed. Here's the latter as an example:
>
https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
>

Attached is a patch on top of the AIO branch which does bitmapheapscan
prefetching using the PgStreamingRead helper already used by sequential
scan and vacuum on the AIO branch.

The prefetch iterator is removed and the main iterator in the
BitmapHeapScanState node is now used by the PgStreamingRead helper.

Some notes about the code:

Each IO will have its own TBMIterateResult allocated and returned by the
PgStreamingRead helper and freed later by
heapam_scan_bitmap_next_block() before requesting the next block.
Previously it was allocated once and saved in the TBMIterator in the
BitmapHeapScanState node and reused. Because of this, the table AM API
routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
as an output parameter.

The PgStreamingRead helper pgsr_private parameter for BitmapHeapScan is
now the actual BitmapHeapScanState node. It needed access to the
iterator, the heap scan descriptor, and a few fields in the
BitmapHeapScanState node that could be moved elsewhere or duplicated
(visibility map buffer and can_skip_fetch, for example). So, it is
possible to either create a new struct or move fields around to avoid
this--but, I'm not sure if that would actually be better.

Because the PgStreamingReadHelper needs to be set up with the
BitmapHeapScanState node but also needs some table AM specific
functions, I thought it made more sense to initialize it using a new
table AM API routine. Instead of fully implementing that I just wrote a
wrapper function, table_bitmap_scan_setup() which just calls
bitmapheap_pgsr_alloc() to socialize the idea before implementing it.

I haven't made the GIN code reasonable yet either (it uses the TID
bitmap functions that I've changed).

There are various TODOs in the code posing questions both to the
reviewer and myself for future versions of the patch.

Oh, also, I haven't updated the failing partition_prune regression test
because I haven't had a chance to look at the EXPLAIN code which adds
the text which is not being produced to see if it is actually a bug in
my code or not.

Oh, and I haven't done testing to see how effective the prefetching is
-- that is a larger project that I have yet to tackle.

- Melanie

Attachment

v1-0001-Use-pgsr-for-AIO-bitmapheapscan.patch

pgsql-hackers by date:

From: Robert Haas
Date: 28 July 2021, 20:28:08
Subject: Re: Have I found an interval arithmetic bug?

From: Tom Lane
Date: 28 July 2021, 20:47:33
Subject: Re: Have I found an interval arithmetic bug?

Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers

Attachment

Previous

Next