Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Asynchronous and "direct" IO support for PostgreSQL.
Date
Msg-id 20210223210545.67wihyu47z3nyii5@alap3.anarazel.de
Whole thread Raw
In response to Re: Asynchronous and "direct" IO support for PostgreSQL.  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
Hi,

On 2021-02-23 14:58:32 -0500, Greg Stark wrote:
> So firstly this is all just awesome work and I have questions but I
> don't want them to come across in any way as criticism or as a demand
> for more work.

I posted it to get argued with ;).


> The callbacks make me curious about two questions:
>
> 1) Is there a chance that a backend issues i/o, the i/o completes in
> some other backend and by the time this backend gets around to looking
> at the buffer it's already been overwritten again? Do we have to
> initiate I/O again or have you found a way to arrange that this
> backend has the buffer pinned from the time the i/o starts even though
> it doesn't handle the comletion?

The initiator of the IO can just keep a pin for the buffer to prevent
that.

There's a lot of complexity around how to handle pinning and locking
around asynchronous buffer IO. I plan to send a separate email with a
more details.

In short: I made it so that for shared buffer IO holds a separate
refcount for the duration of the IO - that way the issuer can release
its own pin without causing a problem (consider e.g. an error while an
IO is in flight). The pin held by the IO gets released in the completion
callback.  There's similar trickery with locks - but more about that later.


> 2) Have you made (or considered making) things like sequential scans
> (or more likely bitmap index scans) asynchronous at a higher level.
> That is, issue a bunch of asynchronous i/o and then handle the pages
> and return the tuples as the pages arrive. Since sequential scans and
> bitmap scans don't guarantee to read the pages in order they're
> generally free to return tuples from any page in any order. I'm not
> sure how much of a win that would actually be since all the same i/o
> would be getting executed and the savings in shared buffers would be
> small but if there are mostly hot pages you could imagine interleaving
> a lot of in-memory pages with the few i/os instead of sitting idle
> waiting for the async i/o to return.

I have not. Mostly because it'd basically break the entire regression
test suite. And probably a lot of user expectations (yes,
synchronize_seqscan exists, but it pretty rarely triggers).

I'm not sure how big the win would be - the readahead for heapam.c that
is in the patch set tries to keep ahead of the "current scan position"
by a certain amount - so it'll issue the reads for the "missing" pages
before they're needed. Hopefully early enough to avoid unnecessary
stalls. But I'm pretty sure there'll be cases where that'd require a
prohibitively long "readahead distance".

I think there's a lot of interesting things along these lines that we
could tackle, but since they involve changing results and/or larger
changes to avoid those (e.g. detecting when sequential scan order isn't
visible to the user) I think it'd make sense to separate them from the
aio patchset itself.

If we had the infrastructure to detect whether seqscan order matters, we
could also switch to the tuple-iteration order to "backwards" in
heapgetpage() - right now iterating forward in "itemid order" causes a
lot of cache misses because the hardware prefetcher doesn't predict that
the tuples are layed out "decreasing pointer order".


> > ## Stats
> >
> > There are two new views: pg_stat_aios showing AIOs that are currently
> > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.
>
> This is impressive. How easy is it to correlate with system aio stats?

Could you say a bit more about what you are trying to correlate?

Here's some example IOs from pg_stat_aios.

┌─[ RECORD 1 ]─┬───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ backend_type │ client backend                                                                                    │
│ id           │ 98                                                                                                │
│ gen          │ 13736                                                                                             │
│ op           │ write                                                                                             │
│ scb          │ sb                                                                                                │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                            │
│ ring         │ 5                                                                                                 │
│ owner_pid    │ 1866051                                                                                           │
│ merge_with   │ (null)                                                                                            │
│ result       │ 0                                                                                                 │
│ desc         │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 238576         │

├─[ RECORD 24 ]┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ checkpointer                                                                                  │
│ id           │ 1501                                                                                          │
│ gen          │ 15029                                                                                         │
│ op           │ write                                                                                         │
│ scb          │ sb                                                                                            │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                        │
│ ring         │ 3                                                                                             │
│ owner_pid    │ 1865275                                                                                       │
│ merge_with   │ 1288                                                                                          │
│ result       │ 0                                                                                             │
│ desc         │ fd: 24, offset: 105136128, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 202705     │

├─[ RECORD 31 ]┼───────────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ walwriter                                                                                         │
│ id           │ 90                                                                                                │
│ gen          │ 26498                                                                                             │
│ op           │ write                                                                                             │
│ scb          │ wal                                                                                               │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                            │
│ ring         │ 5                                                                                                 │
│ owner_pid    │ 1865281                                                                                           │
│ merge_with   │ 181                                                                                               │
│ result       │ 0                                                                                                 │
│ desc         │ write_no: 17, fd: 12, offset: 6307840, nbytes: 1048576, already_done: 0, bufdata: 0x7f94dd670000  │
└──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘


And the per-backend AIO view shows information like this (too wide to
display all cols):

SELECT
    pid, last_context, backend_type,
    executed_total_count,issued_total_count,submissions_total_count, inflight_count
FROM pg_stat_aio_backends sab JOIN pg_stat_activity USING (pid);

┌─────────┬──────────────┬──────────────────────────────┬──────────────────────┬────────────────────┬─────────────────────────┬────────────────┐
│   pid   │ last_context │         backend_type         │ executed_total_count │ issued_total_count │
submissions_total_count│ inflight_count │
 

├─────────┼──────────────┼──────────────────────────────┼──────────────────────┼────────────────────┼─────────────────────────┼────────────────┤
│ 1865291 │            3 │ logical replication launcher │                    0 │                  0 │
   0 │              0 │
 
│ 1865296 │            0 │ client backend               │                   85 │                 85 │
  85 │              0 │
 
│ 1865341 │            7 │ client backend               │              9574416 │            2905321 │
345642│              0 │
 
...
│ 1873501 │            3 │ client backend               │                 3565 │               3565 │
 467 │             10 │
 
...
│ 1865277 │            0 │ background writer            │               695402 │             575906 │
13513│              0 │
 
│ 1865275 │            3 │ checkpointer                 │              4664110 │            3488530 │
1399896│              0 │
 
│ 1865281 │            3 │ walwriter                    │                77203 │               7759 │
7747 │              3 │
 

└─────────┴──────────────┴──────────────────────────────┴──────────────────────┴────────────────────┴─────────────────────────┴────────────────┘

It's not super obvious at this point but executed_total_count /
issued_total_count shows the rate at which IOs have been
merged. issued_total_count / submissions_total_count shows how many
(already merged) IOs were submitted together in one "submission batch"
(for io_method=io_uring, io_uring_enter()).


So pg_stat_aios should allow to enrich some system stats - e.g. by being
able to split out WAL writes from data file writes. And - except that
the current callbacks for that aren't great - it should even allow to
split the IO by different relfilenodes etc.


I assume pg_stat_aio_backends also can be helpful, e.g. by seeing which
backends currently the deep IO queues that cause latency delays in other
backends, and which backends do a lot of sequential IO (high
executed_total_count / issued_total_count) and which only random...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: Asynchronous and "direct" IO support for PostgreSQL.
Next
From: Thomas Munro
Date:
Subject: Re: Asynchronous and "direct" IO support for PostgreSQL.