Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: Asynchronous and "direct" IO support for PostgreSQL. |
Date | |
Msg-id | CAAKRu_bdJQjAddtsWJrrcB8rx60wA6E8y5AcLNwQ9RLsvjqqXQ@mail.gmail.com Whole thread Raw |
In response to | Re: Asynchronous and "direct" IO support for PostgreSQL. (Melanie Plageman <melanieplageman@gmail.com>) |
List | pgsql-hackers |
On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <andres@anarazel.de> wrote: > > > > ## AIO API overview > > > > The main steps to use AIO (without higher level helpers) are: > > > > 1) acquire an "unused" AIO: pgaio_io_get() > > > > 2) start some IO, this is done by functions like > > pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal) > > > > The (read|write|fsync|flush_range) indicates the operation, whereas > > (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled. > > > > (see below for more details about this design choice - it might or not be > > right) > > > > 3) optionally: assign a backend-local completion callback to the IO > > (pgaio_io_on_completion_local()) > > > > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be > > put on a per-backend list of pending IOs. The pending IOs can be explicitly > > be flushed pgaio_submit_pending(), but will also be submitted if the > > pending list gets to be too large, or if the current backend waits for the > > IO. > > > > The are two main reasons not to submit the IO immediately: > > - If adjacent, we can merge several IOs into one "kernel level" IO during > > submission. Larger IOs are considerably more efficient. > > - Several AIO APIs allow to submit a batch of IOs in one system call. > > > > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current > > backend. When other backends may need to wait for an IO to finish, > > pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a > > BufferDesc), which can be waited for using pgaio_io_wait_ref(). > > > > 6) Process the results of the request. If a callback was registered in 3), > > this isn't always necessary. The results of AIO can be accessed using > > pgaio_io_result() which returns an integer where negative numbers are > > -errno, and positive numbers are the [partial] success conditions > > (e.g. potentially indicating a short read). > > > > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for > > another operation (pgaio_io_recycle()) > > > > > > Most places that want to use AIO shouldn't themselves need to care about > > managing the number of writes in flight, or the readahead distance. To help > > with that there are two helper utilities, a "streaming read" and a "streaming > > write". > > > > The "streaming read" helper uses a callback to determine which blocks to > > prefetch - that allows to do readahead in a sequential fashion but importantly > > also allows to asynchronously "read ahead" non-sequential blocks. > > > > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map > > to figure out which block needs to be read next. Similarly lazy_vacuum_heap() > > uses the tids in LVDeadTuples to figure out which blocks are going to be > > needed. Here's the latter as an example: > > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906 > > > > Attached is a patch on top of the AIO branch which does bitmapheapscan > prefetching using the PgStreamingRead helper already used by sequential > scan and vacuum on the AIO branch. > > The prefetch iterator is removed and the main iterator in the > BitmapHeapScanState node is now used by the PgStreamingRead helper. > ... > > Oh, and I haven't done testing to see how effective the prefetching is > -- that is a larger project that I have yet to tackle. > I have done some testing on how effective it is now. I've also updated the original patch to count the first page (in the lossy/exact page counts mentioned down-thread) as well as to remove unused prefetch fields and comments. I've also included a second patch which adds IO wait time information to EXPLAIN output when used like: EXPLAIN (buffers, analyze) SELECT ... The same commit also introduces a temporary dev GUC io_bitmap_prefetch_depth which I am using to experiment with the prefetch window size. I wanted to share some results from changing the prefetch window to demonstrate how prefetching is working. The short version of my results is that the prefetching works: - with the prefetch window set to 1, the IO wait time is 1550 ms - with the prefetch window set to 128, the IO wait time is 0.18 ms DDL and repro details below: On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set applied built with the following build flags: -02 -fno-omit-frame-pointer --with-liburing And these non-default PostgreSQL settings: io_data_direct=1 io_data_force_async=off io_method=io_uring log_min_duration_statement=0 log_duration=on set track_io_timing to on; set max_parallel_workers_per_gather to 0; set enable_seqscan to off; set enable_indexscan to off; set enable_bitmapscan to on; set effective_io_concurrency to 128; set io_bitmap_prefetch_depth to 128; Using this DDL: drop table if exists bar; create table bar(a int, b text, c text, d int); create index bar_idx on bar(a); insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i; insert into bar select i%3, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i; insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,200)i; insert into bar select i%100, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i%2000, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i%10, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i%100, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i; insert into bar select i%10, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i; analyze; And this query: select * from bar where a > 100 offset 10000000000000; with the prefetch window set to 1, the query execution time is: 5496.129 ms and IO wait time is: 1550.915 mplageman=# explain (buffers, analyze, timing off) select * from bar where a > 100 offset 10000000000000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1) Buffers: shared hit=1 read=280571 I/O Timings: read=1315.845 wait=1550.915 -> Bitmap Heap Scan on bar (cost=240521.25..1462959.87 rows=19270298 width=68) (actual rows=19497800 loops=1) Recheck Cond: (a > 100) Rows Removed by Index Recheck: 400281 Heap Blocks: exact=47915 lossy=197741 Buffers: shared hit=1 read=280571 I/O Timings: read=1315.845 wait=1550.915 -> Bitmap Index Scan on bar_idx (cost=0.00..235703.67 rows=19270298 width=0) (actual rows=19497800 loops=1) Index Cond: (a > 100) Buffers: shared hit=1 read=34915 I/O Timings: read=1315.845 Planning: Buffers: shared hit=96 read=30 I/O Timings: read=3.399 Planning Time: 4.378 ms Execution Time: 5473.404 ms (18 rows) with the prefetch window set to 128, the query execution time is: 3222 ms and IO wait time is; 0.178 ms mplageman=# explain (buffers, analyze, timing off) select * from bar where a > 100 offset 10000000000000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1) Buffers: shared hit=1 read=280571 I/O Timings: read=1339.795 wait=0.178 -> Bitmap Heap Scan on bar (cost=240521.25..1462959.87 rows=19270298 width=68) (actual rows=19497800 loops=1) Recheck Cond: (a > 100) Rows Removed by Index Recheck: 400281 Heap Blocks: exact=47915 lossy=197741 Buffers: shared hit=1 read=280571 I/O Timings: read=1339.795 wait=0.178 -> Bitmap Index Scan on bar_idx (cost=0.00..235703.67 rows=19270298 width=0) (actual rows=19497800 loops=1) Index Cond: (a > 100) Buffers: shared hit=1 read=34915 I/O Timings: read=1339.795 Planning: Buffers: shared hit=96 read=30 I/O Timings: read=3.488 Planning Time: 4.279 ms Execution Time: 3434.522 ms (18 rows) - Melanie [1] https://github.com/anarazel/postgres/tree/aio
Attachment
pgsql-hackers by date: