Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Asynchronous and "direct" IO support for PostgreSQL.
Date
Msg-id CAAKRu_bdJQjAddtsWJrrcB8rx60wA6E8y5AcLNwQ9RLsvjqqXQ@mail.gmail.com
Whole thread Raw
In response to Re: Asynchronous and "direct" IO support for PostgreSQL.  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > ## AIO API overview
> >
> > The main steps to use AIO (without higher level helpers) are:
> >
> > 1) acquire an "unused" AIO: pgaio_io_get()
> >
> > 2) start some IO, this is done by functions like
> >    pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
> >
> >    The (read|write|fsync|flush_range) indicates the operation, whereas
> >    (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
> >
> >    (see below for more details about this design choice - it might or not be
> >    right)
> >
> > 3) optionally: assign a backend-local completion callback to the IO
> >    (pgaio_io_on_completion_local())
> >
> > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
> >    put on a per-backend list of pending IOs. The pending IOs can be explicitly
> >    be flushed pgaio_submit_pending(), but will also be submitted if the
> >    pending list gets to be too large, or if the current backend waits for the
> >    IO.
> >
> >    The are two main reasons not to submit the IO immediately:
> >    - If adjacent, we can merge several IOs into one "kernel level" IO during
> >      submission. Larger IOs are considerably more efficient.
> >    - Several AIO APIs allow to submit a batch of IOs in one system call.
> >
> > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
> >    backend. When other backends may need to wait for an IO to finish,
> >    pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
> >    BufferDesc), which can be waited for using pgaio_io_wait_ref().
> >
> > 6) Process the results of the request. If a callback was registered in 3),
> >    this isn't always necessary. The results of AIO can be accessed using
> >    pgaio_io_result() which returns an integer where negative numbers are
> >    -errno, and positive numbers are the [partial] success conditions
> >    (e.g. potentially indicating a short read).
> >
> > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
> >    another operation (pgaio_io_recycle())
> >
> >
> > Most places that want to use AIO shouldn't themselves need to care about
> > managing the number of writes in flight, or the readahead distance. To help
> > with that there are two helper utilities, a "streaming read" and a "streaming
> > write".
> >
> > The "streaming read" helper uses a callback to determine which blocks to
> > prefetch - that allows to do readahead in a sequential fashion but importantly
> > also allows to asynchronously "read ahead" non-sequential blocks.
> >
> > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
> > to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
> > uses the tids in LVDeadTuples to figure out which blocks are going to be
> > needed. Here's the latter as an example:
> >
https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
> >
>
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.
>
...
>
> Oh, and I haven't done testing to see how effective the prefetching is
> -- that is a larger project that I have yet to tackle.
>

I have done some testing on how effective it is now.

I've also updated the original patch to count the first page (in the
lossy/exact page counts mentioned down-thread) as well as to remove
unused prefetch fields and comments.
I've also included a second patch which adds IO wait time information to
EXPLAIN output when used like:
  EXPLAIN (buffers, analyze) SELECT ...

The same commit also introduces a temporary dev GUC
io_bitmap_prefetch_depth which I am using to experiment with the
prefetch window size.

I wanted to share some results from changing the prefetch window to
demonstrate how prefetching is working.

The short version of my results is that the prefetching works:

- with the prefetch window set to 1, the IO wait time is 1550 ms
- with the prefetch window set to 128, the IO wait time is 0.18 ms

DDL and repro details below:

On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set
applied built with the following build flags:
-02 -fno-omit-frame-pointer --with-liburing

And these non-default PostgreSQL settings:
  io_data_direct=1
  io_data_force_async=off
  io_method=io_uring
  log_min_duration_statement=0
  log_duration=on
  set track_io_timing to on;

  set max_parallel_workers_per_gather to 0;
  set enable_seqscan to off;
  set enable_indexscan to off;
  set enable_bitmapscan to on;

  set effective_io_concurrency to 128;
  set io_bitmap_prefetch_depth to 128;

Using this DDL:

drop table if exists bar;
create table bar(a int, b text, c text, d int);
create index bar_idx on bar(a);
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,1000)i;
insert into bar select i%3, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,200)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%2000, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,10000000)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,2000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i;
analyze;

And this query:

select * from bar where a > 100 offset 10000000000000;

with the prefetch window set to 1,
the query execution time is:
5496.129 ms

and IO wait time is:
1550.915

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
                                                      QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
   Buffers: shared hit=1 read=280571
   I/O Timings: read=1315.845 wait=1550.915
   ->  Bitmap Heap Scan on bar  (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
         Recheck Cond: (a > 100)
         Rows Removed by Index Recheck: 400281
         Heap Blocks: exact=47915 lossy=197741
         Buffers: shared hit=1 read=280571
         I/O Timings: read=1315.845 wait=1550.915
         ->  Bitmap Index Scan on bar_idx  (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
               Index Cond: (a > 100)
               Buffers: shared hit=1 read=34915
               I/O Timings: read=1315.845
 Planning:
   Buffers: shared hit=96 read=30
   I/O Timings: read=3.399
 Planning Time: 4.378 ms
 Execution Time: 5473.404 ms
(18 rows)

with the prefetch window set to 128,
the query execution time is:
3222 ms

and IO wait time is;
0.178 ms

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
   Buffers: shared hit=1 read=280571
   I/O Timings: read=1339.795 wait=0.178
   ->  Bitmap Heap Scan on bar  (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
         Recheck Cond: (a > 100)
         Rows Removed by Index Recheck: 400281
         Heap Blocks: exact=47915 lossy=197741
         Buffers: shared hit=1 read=280571
         I/O Timings: read=1339.795 wait=0.178
         ->  Bitmap Index Scan on bar_idx  (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
               Index Cond: (a > 100)
               Buffers: shared hit=1 read=34915
               I/O Timings: read=1339.795
 Planning:
   Buffers: shared hit=96 read=30
   I/O Timings: read=3.488
 Planning Time: 4.279 ms
 Execution Time: 3434.522 ms
(18 rows)

- Melanie

[1] https://github.com/anarazel/postgres/tree/aio

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Another regexp performance improvement: skip useless paren-captures
Next
From: Michail Nikolaev
Date:
Subject: Re: Slow standby snapshot