Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: Asynchronous and "direct" IO support for PostgreSQL. |
Date | |
Msg-id | CAAKRu_a4CiCH+Rxw39LVoCJMRP9Ns1rg+_vyQz4rhrMmLHZAvQ@mail.gmail.com Whole thread Raw |
In response to | Asynchronous and "direct" IO support for PostgreSQL. (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Asynchronous and "direct" IO support for PostgreSQL.
Re: Asynchronous and "direct" IO support for PostgreSQL. |
List | pgsql-hackers |
On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <andres@anarazel.de> wrote: > > ## AIO API overview > > The main steps to use AIO (without higher level helpers) are: > > 1) acquire an "unused" AIO: pgaio_io_get() > > 2) start some IO, this is done by functions like > pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal) > > The (read|write|fsync|flush_range) indicates the operation, whereas > (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled. > > (see below for more details about this design choice - it might or not be > right) > > 3) optionally: assign a backend-local completion callback to the IO > (pgaio_io_on_completion_local()) > > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be > put on a per-backend list of pending IOs. The pending IOs can be explicitly > be flushed pgaio_submit_pending(), but will also be submitted if the > pending list gets to be too large, or if the current backend waits for the > IO. > > The are two main reasons not to submit the IO immediately: > - If adjacent, we can merge several IOs into one "kernel level" IO during > submission. Larger IOs are considerably more efficient. > - Several AIO APIs allow to submit a batch of IOs in one system call. > > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current > backend. When other backends may need to wait for an IO to finish, > pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a > BufferDesc), which can be waited for using pgaio_io_wait_ref(). > > 6) Process the results of the request. If a callback was registered in 3), > this isn't always necessary. The results of AIO can be accessed using > pgaio_io_result() which returns an integer where negative numbers are > -errno, and positive numbers are the [partial] success conditions > (e.g. potentially indicating a short read). > > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for > another operation (pgaio_io_recycle()) > > > Most places that want to use AIO shouldn't themselves need to care about > managing the number of writes in flight, or the readahead distance. To help > with that there are two helper utilities, a "streaming read" and a "streaming > write". > > The "streaming read" helper uses a callback to determine which blocks to > prefetch - that allows to do readahead in a sequential fashion but importantly > also allows to asynchronously "read ahead" non-sequential blocks. > > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map > to figure out which block needs to be read next. Similarly lazy_vacuum_heap() > uses the tids in LVDeadTuples to figure out which blocks are going to be > needed. Here's the latter as an example: > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906 > Attached is a patch on top of the AIO branch which does bitmapheapscan prefetching using the PgStreamingRead helper already used by sequential scan and vacuum on the AIO branch. The prefetch iterator is removed and the main iterator in the BitmapHeapScanState node is now used by the PgStreamingRead helper. Some notes about the code: Each IO will have its own TBMIterateResult allocated and returned by the PgStreamingRead helper and freed later by heapam_scan_bitmap_next_block() before requesting the next block. Previously it was allocated once and saved in the TBMIterator in the BitmapHeapScanState node and reused. Because of this, the table AM API routine, table_scan_bitmap_next_block() now defines the TBMIterateResult as an output parameter. The PgStreamingRead helper pgsr_private parameter for BitmapHeapScan is now the actual BitmapHeapScanState node. It needed access to the iterator, the heap scan descriptor, and a few fields in the BitmapHeapScanState node that could be moved elsewhere or duplicated (visibility map buffer and can_skip_fetch, for example). So, it is possible to either create a new struct or move fields around to avoid this--but, I'm not sure if that would actually be better. Because the PgStreamingReadHelper needs to be set up with the BitmapHeapScanState node but also needs some table AM specific functions, I thought it made more sense to initialize it using a new table AM API routine. Instead of fully implementing that I just wrote a wrapper function, table_bitmap_scan_setup() which just calls bitmapheap_pgsr_alloc() to socialize the idea before implementing it. I haven't made the GIN code reasonable yet either (it uses the TID bitmap functions that I've changed). There are various TODOs in the code posing questions both to the reviewer and myself for future versions of the patch. Oh, also, I haven't updated the failing partition_prune regression test because I haven't had a chance to look at the EXPLAIN code which adds the text which is not being produced to see if it is actually a bug in my code or not. Oh, and I haven't done testing to see how effective the prefetching is -- that is a larger project that I have yet to tackle. - Melanie
Attachment
pgsql-hackers by date: