Thread: Re: AIO v2.2

Re: AIO v2.2

From

Noah Misch

Date:

06 January, 21:52:20

Patches 1 and 2 are still Ready for Committer.

On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:
> - The README has been extended with an overview of the API. I think it gives a
>   good overview of how the API fits together.  I'd be very good to get
>   feedback from folks that aren't as familiar with AIO, I can't really see
>   what's easy/hard anymore.

That's a helpful addition.  I've left inline comments on it, below.

> The biggest TODOs are:
> 
> - Right now the API between bufmgr.c and read_stream.c kind of necessitates
>   that one StartReadBuffers() call actually can trigger multiple IOs, if
>   one of the buffers was read in by another backend, before "this" backend
>   called StartBufferIO().
> 
>   I think Thomas and I figured out a way to evolve the interface so that this
>   isn't necessary anymore:
> 
>   We allow StartReadBuffers() to memorize buffers it pinned but didn't
>   initiate IO on in the buffers[] argument. The next call to StartReadBuffers
>   then doesn't have to repin thse buffers. That doesn't just solve the
>   multiple-IOs for one "read operation" issue, it also make the - very common
>   - case of a bunch of "buffer misses" followed by a "buffer hit" cleaner, the
>   hit wouldn't be tracked in the same ReadBuffersOperation anymore.

That sounds reasonable.

> - Right now bufmgr.h includes aio.h, because it needs to include a reference
>   to the AIO's result in ReadBuffersOperation. Requiring a dynamic allocation
>   would be noticeable overhead, so that's not an option. I think the best
>   option here would be to introduce something like aio_types.h, so fewer
>   things are included.

That sounds fine.  Header splits aren't going to be perfect, so I'd pick
something (e.g. your proposal here) and move on.

> - There's no obvious way to tell "internal" function operating on an IO handle
>   apart from functions that are expected to be called by the issuer of an IO.
> 
>   One way to deal with this would be to introduce a distinct "issuer IO
>   reference" type.  I think that might be a good idea, it would also make it
>   clearer that a good number of the functions can only be called by the
>   issuer, before the IO is submitted.

That's reasonable, albeit non-critical.

> - The naming around PgAioReturn, PgAioResult, PgAioResultStatus needs to be
>   improved

POSIX uses the word "result" for the consequences of a function (e.g. the
result of unlink() is readdir() no longer finding the link).  It uses the word
"return" for a memory value that describes a result.  In that usage, the
struct currently called PgAioResult would be a Return.  The struct currently
called PgAioReturn is PgAioResult plus the data to identify the IO.  Possible
name changes:

  PgAioResult -> PgAioReturn
  PgAioReturn -> PgAioReturnIdentified | PgAioReturnID | PgAioReturnTagged [I don't love these]
  PgAioResultStatus -> PgAioStatus | PgAioFill

That said, I don't dislike the existing names and would not have raised the
topic myself.

> - The debug logging functions are a bit of a mess, lots of very similar code
>   in lots of places. I think AIO needs a few ereport() wrappers to make this
>   easier.

May as well.

> - More tests are needed. None of our current test frameworks really makes this
>   easy :(.

Which testing gap do you find most concerning?  I'd be most interested in the
cases that would be undetected deadlocks under a naive design.  An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com

> - Several folks asked for pg_stat_aio to come back, in "v1" that showed the
>   set of currently in-flight AIOs.  That's not particularly hard - except
>   that it doesn't really fit in the pg_stat_* namespace.

Later message
postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
considering the name pg_aios.  Works for me.

> --- a/src/backend/storage/aio/aio.c
> +++ b/src/backend/storage/aio/aio.c
> @@ -3,6 +3,28 @@
>   * aio.c
>   *    AIO - Core Logic
>   *
> + * For documentation about how AIO works on a higher level, including a
> + * schematic example, see README.md.
> + *
> + *
> + * AIO is a complicated subsystem. To keep things navigable it is split across
> + * a number of files:
> + *
> + * - aio.c - core AIO state handling
> + *
> + * - aio_init.c - initialization
> + *
> + * - aio_io.c - dealing with actual IO, including executing IOs synchronously
> + *
> + * - aio_subject.c - functionality related to executing IO for different
> + *   subjects
> + *
> + * - method_*.c - different ways of executing AIO
> + *
> + * - read_stream.c - helper for accessing buffered relation data with
> + *     look-ahead
> + *

I felt like some list entries in this new header comment largely restated the
file name.  Here's how I'd write them to avoid that:

 * - method_*.c - different ways of executing AIO (e.g. worker process)
 * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
 * - aio_subject.c - callbacks at IO operation lifecycle events
 * - aio_init.c - per-fork and per-startup-process initialization
 * - aio.c - all other topics
 * - read_stream.c - helper for reading buffered relation data

> --- /dev/null
> +++ b/src/backend/storage/aio/README.md
> @@ -0,0 +1,413 @@
> +# Asynchronous & Direct IO

I would move "### Why Asynchronous IO" to here; that's good background before
getting into the example.  I might also move "### Why Direct / unbuffered IO"
to here.  For me as a reader, I'd benefit from seeing things in this order:

- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitions
- usage example as it is, with full comments
- the rest

In other words, like this:

# Asynchronous & Direct IO

## Motivation

### Why Asynchronous IO

[existing content moved from lower in the file]

## Synopsis

ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
               BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
    pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);

## I/O Operation States & Transitions

[PgAioHandleState and its transitions]

## AIO Usage Example

[your content:]

> +
> +## AIO Usage Example
> +
> +In many cases code that can benefit from AIO does not directly have to
> +interact with the AIO interface, but can use AIO via higher-level
> +abstractions. See [Helpers](#helpers).
> +
> +In this example, a buffer will be read into shared buffers.
> +
> +```C
> +/*
> + * Result of the operation, only to be accessed in this backend.
> + */
> +PgAioReturn ioret;
> +
> +/*
> + * Acquire AIO Handle, ioret will get result upon completion.

Consider adding: from here to pgaio_submit_staged(), don't do [description of
the kind of unacceptable blocking operations].

> + * Once the IO handle has been handed of, it may not further be used, as the

s/of/off/

> +### IO can be started in critical sections
...
> +The need to be able to execute IO in critical sections has substantial design
> +implication on the AIO subsystem. Mainly because completing IOs (see prior
> +section) needs to be possible within a critical section, even if the
> +to-be-completed IO itself was not issued in a critical section. Consider
> +e.g. the case of a backend first starting a number of writes from shared
> +buffers and then starting to flush the WAL. Because only a limited amount of
> +IO can be in-progress at the same time, initiating the IO for flushing the WAL
> +may require to first finish executing IO executed earlier.

The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms.  Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".

> +ASLR. This means that the shared memory cannot contain pointer to callbacks.

s/pointer/pointers/

> +### AIO Callbacks
...
> +In addition to completion, AIO callbacks also are called to "prepare" an
> +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
> +IO to/from shared buffers, which is required to handle the case where the
> +issuing backend errors out and releases its own pins.

Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section".  Pinning a buffer in a critical section sounds bad.  I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten.  Can this text have a sentence making that explicit?

> +### AIO Subjects
> +
> +In addition to the completion callbacks describe above, each AIO Handle has
> +exactly one "subject". Each subject has some space inside an AIO Handle with
> +information specific to the subject and can provide callbacks to allow to
> +reopen the underlying file (required for worker mode) and to describe the IO
> +operation (used for debug logging and error messages).

Can this say roughly how to decide when to add a new subject?  Failing that,
can it give examples of what additional subjects might exist if certain
existing subsystems were to start using AIO?

> +### AIO Results
> +
> +As AIO completion callbacks
> +[are executed in critical sections](#io-can-be-started-in-critical-sections)
> +and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
> +completion callbacks cannot be used to, e.g., make the query that triggered an
> +IO ERROR out.
> +
> +To allow to react to failing IOs the issuing backend can pass a pointer to a
> +`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
> +`PgAioReturn` is filled with information about the IO. This includes
> +information about whether the IO was successful (as a value of
> +`PgAioResultStatus`) and enough information to raise an error in case of a
> +failure (via `pgaio_result_log()`, with the error details encoded in
> +`PgAioResult`).

Can this have a sentence on how this fits in bounded shmem, given the lack of
guarantees about a backend's responsiveness?  In other words, what makes it
okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
to AHS_COMPLETED_LOCAL?

Thanks,
nm

Re: AIO v2.2

From

Andres Freund

Date:

07 January, 00:40:26

Hi,

On 2025-01-06 10:52:20 -0800, Noah Misch wrote:
> Patches 1 and 2 are still Ready for Committer.

I feel somewhat weird about pushing 0002 without a user, but I guess it's
still exercised, so it's probably fine...


> On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:
> > - The README has been extended with an overview of the API. I think it gives a
> >   good overview of how the API fits together.  I'd be very good to get
> >   feedback from folks that aren't as familiar with AIO, I can't really see
> >   what's easy/hard anymore.
>
> That's a helpful addition.  I've left inline comments on it, below.

Cool!


> > - More tests are needed. None of our current test frameworks really makes this
> >   easy :(.
>
> Which testing gap do you find most concerning?

Most of it isn't even AIO specific...


- temporary tables are rather poorly tested in general:
  - e.g. trivial to exceed the number of buffers, but our tests don't reach that

- We have pretty no testing for IO errors. We have a bit of coverage due to
  src/bin/pg_amcheck/t/003_check.pl, but that's for errors originating in
  bufmgr.c itself.

- no real testing of StartBufferIO's etc wait paths

- no testing for BM_PIN_COUNT_WAITER


I e.g. just noticed that the error handling for AIO on temp tables was broken
- but our tests never reach that:

  The bug exists due to temp tables not differentiating between "backend" pins
  and a "global pincount" - which means that there's no real way for the AIO
  subsystem to have a reference separate from the backend local pin -
  CheckForLocalBufferLeaks() complains about any leftover pins. It seems to
  works in non-assert mode, but with assertions transaction abort asserts out.


> I'd be most interested in the
> cases that would be undetected deadlocks under a naive design.  An example
> appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com

That's a good one, yea.

I think I'll try to translate the regression tests I wrote into an isolation
test, I hope that'll make it a bit easier to cover more cases.

And then we'll need more injection points, I'm afraid :(.


> > - Several folks asked for pg_stat_aio to come back, in "v1" that showed the
> >   set of currently in-flight AIOs.  That's not particularly hard - except
> >   that it doesn't really fit in the pg_stat_* namespace.
>
> Later message
> postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
> considering the name pg_aios.  Works for me.

Cool.


> > --- a/src/backend/storage/aio/aio.c
> > +++ b/src/backend/storage/aio/aio.c
> > @@ -3,6 +3,28 @@
> >   * aio.c
> >   *    AIO - Core Logic
> >   *
> > + * For documentation about how AIO works on a higher level, including a
> > + * schematic example, see README.md.
> > + *
> > + *
> > + * AIO is a complicated subsystem. To keep things navigable it is split across
> > + * a number of files:
> > + *
> > + * - aio.c - core AIO state handling
> > + *
> > + * - aio_init.c - initialization
> > + *
> > + * - aio_io.c - dealing with actual IO, including executing IOs synchronously
> > + *
> > + * - aio_subject.c - functionality related to executing IO for different
> > + *   subjects
> > + *
> > + * - method_*.c - different ways of executing AIO
> > + *
> > + * - read_stream.c - helper for accessing buffered relation data with
> > + *     look-ahead
> > + *
>
> I felt like some list entries in this new header comment largely restated the
> file name.  Here's how I'd write them to avoid that:

Thanks, adopting.


>  * - method_*.c - different ways of executing AIO (e.g. worker process)
>  * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
>  * - aio_subject.c - callbacks at IO operation lifecycle events
>  * - aio_init.c - per-fork and per-startup-process initialization

I don't particularly like "per-startup-process", because "global
initialization" really is separate (and precedes) from startup processes
startup. Maybe "per-server and per-backend initialization"?

>  * - aio.c - all other topics
>  * - read_stream.c - helper for reading buffered relation data

Did the order you listed the files have a system to it? If so, what is it?


> > --- /dev/null
> > +++ b/src/backend/storage/aio/README.md
> > @@ -0,0 +1,413 @@
> > +# Asynchronous & Direct IO
>
> I would move "### Why Asynchronous IO" to here; that's good background before
> getting into the example.

I moved the example back and forth when writing because different readers
would benefit from a different order and I couldn't quite decide.

So I'm happy to adjust based on your feedback...


> I might also move "### Why Direct / unbuffered IO" to here.  For me as a
> reader, I'd benefit from seeing things in this order:
>
> - "why"
> - condensed usage example like manpage SYNOPSIS, comments and decls removed
> - PgAioHandleState and discussion of valid transitions

Hm - why have PgAioHandleState and its states before the usage example?  Seems
like it'd be harder to understand that way.

> - usage example as it is, with full comments
> - the rest


> ## Synopsis
>
> ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
> pgaio_io_get_ref(ioh, &ior);
> pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
> pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
> smgrstartreadv(ioh, operation->smgr, forknum, blkno,
>                BufferGetBlock(buffer), 1);
> pgaio_submit_staged();
> pgaio_io_ref_wait(&ior);
> if (ioret.result.status == ARS_ERROR)
>     pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);

Happy to add this, but I'm not entirely sure if that's really that useful to
have without commentary? The synopsis in manpages is helpful because it
provides the signature of various functions, but this wouldn't...



> > +
> > +## AIO Usage Example
> > +
> > +In many cases code that can benefit from AIO does not directly have to
> > +interact with the AIO interface, but can use AIO via higher-level
> > +abstractions. See [Helpers](#helpers).
> > +
> > +In this example, a buffer will be read into shared buffers.
> > +
> > +```C
> > +/*
> > + * Result of the operation, only to be accessed in this backend.
> > + */
> > +PgAioReturn ioret;
> > +
> > +/*
> > + * Acquire AIO Handle, ioret will get result upon completion.
>
> Consider adding: from here to pgaio_submit_staged(), don't do [description of
> the kind of unacceptable blocking operations].

Hm.  Strictly speaking it's fine to block here, depending on whether
StartBufferIO() was already called.  I'll clarify.


> > +### IO can be started in critical sections
> ...
> > +The need to be able to execute IO in critical sections has substantial design
> > +implication on the AIO subsystem. Mainly because completing IOs (see prior
> > +section) needs to be possible within a critical section, even if the
> > +to-be-completed IO itself was not issued in a critical section. Consider
> > +e.g. the case of a backend first starting a number of writes from shared
> > +buffers and then starting to flush the WAL. Because only a limited amount of
> > +IO can be in-progress at the same time, initiating the IO for flushing the WAL
> > +may require to first finish executing IO executed earlier.
>
> The last line's two appearances of the word "execute" read awkwardly to me,
> and it's an opportunity to use PgAioHandleState terms.  Consider writing the
> last line like "may first advance an existing IO from AHS_PREPARED to
> AHS_COMPLETED_SHARED".

It is indeed awkward.  I don't love referencing the state-constants here
though, somehow that feels like a reference-cycle ;). What about this:

> ... Consider
> e.g. the case of a backend first starting a number of writes from shared
> buffers and then starting to flush the WAL. Because only a limited amount of
> IO can be in-progress at the same time, initiating IO for flushing the WAL may
> require to first complete IO that was started earlier.



> > +### AIO Callbacks
> ...
> > +In addition to completion, AIO callbacks also are called to "prepare" an
> > +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
> > +IO to/from shared buffers, which is required to handle the case where the
> > +issuing backend errors out and releases its own pins.
>
> Reading this, it's not obvious to me how to reconcile "finishing an IO could
> require pin acquisition" with "finishing an IO could happen in a critical
> section".  Pinning a buffer in a critical section sounds bad.  I vaguely
> recall understanding how it was okay as of my September review, but I've
> already forgotten.  Can this text have a sentence making that explicit?

Ah, yes, that's easy to misunderstand. The answer basically is that we don't
newly pin a buffer, we just increment the reference count by 1. That should
never fail.

How about:
> In addition to completion, AIO callbacks also are called to "prepare" an
> IO. This is, e.g., used to increase buffer reference counts to account for the
> AIO subsystem referencing the buffer, which is required to handle the case
> where the issuing backend errors out and releases its own pins while the IO is
> still ongoing.



> > +### AIO Subjects
> > +
> > +In addition to the completion callbacks describe above, each AIO Handle has
> > +exactly one "subject". Each subject has some space inside an AIO Handle with
> > +information specific to the subject and can provide callbacks to allow to
> > +reopen the underlying file (required for worker mode) and to describe the IO
> > +operation (used for debug logging and error messages).
>
> Can this say roughly how to decide when to add a new subject?

Hm, there obviously is some fuzziness. I was trying to get to some of that by
mentioning that the subject needs to know how to [re-]open a file and describe
the target of the IO in terms that make sense to the user.

E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
open a file by delegating that to the layer below and the layer above just
knows about smgr, not md.c (or other potential smgr implementations).

The reason to keep this separate from the callbacks was that smgr IO going
through shared buffers, bypassing shared buffers and different smgr
implemenentations all could share the same subject implementation, even if
callbacks would differ between these use cases.


How about:

> I.e., if two different uses of AIO can describe the identity of the file being
> operated on the same way, it likely makes sense to use the same
> subject. E.g. different smgr implementations can describe IO with
> RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
> contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
> and it would not make sense to use the same subject for smgr and WAL.




> Failing that, can it give examples of what additional subjects might exist
> if certain existing subsystems were to start using AIO?

I think the main ones I can think of are:

1) WAL logging

   This was implemented in v1. I'd guess that "real" WAL logging and
   initializing new WAL segments might use a different subject, but that's
   probably a question of taste.

2) "raw" file IO, for things that don't use the smgr abstraction. I could
   e.g. imagine using AIO in COPY to read / write the FROM/TO file or to
   implement CREATE DATABASE ... STRATEGY file_copy with AIO.

   This was used in v1, e.g. to implement the initial data directory sync
   after a crash. We do that on a filesystem level, not going through smgr
   etc.

3) FE/BE network IO



> > +### AIO Results
> > +
> > +As AIO completion callbacks
> > +[are executed in critical sections](#io-can-be-started-in-critical-sections)
> > +and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
> > +completion callbacks cannot be used to, e.g., make the query that triggered an
> > +IO ERROR out.
> > +
> > +To allow to react to failing IOs the issuing backend can pass a pointer to a
> > +`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
> > +`PgAioReturn` is filled with information about the IO. This includes
> > +information about whether the IO was successful (as a value of
> > +`PgAioResultStatus`) and enough information to raise an error in case of a
> > +failure (via `pgaio_result_log()`, with the error details encoded in
> > +`PgAioResult`).
>
> Can this have a sentence on how this fits in bounded shmem, given the lack of
> guarantees about a backend's responsiveness?  In other words, what makes it
> okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
> to AHS_COMPLETED_LOCAL?

I agree this should be explained somewhere - but not sure this is the best
place.

The reason it's ok is that each backend has a limited number of AIO handles
and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
wait for the oldest IO to finish.


Thanks for the review!

Andres Freund

Re: AIO v2.2

From

Noah Misch

Date:

07 January, 22:10:59

On Mon, Jan 06, 2025 at 04:40:26PM -0500, Andres Freund wrote:
> On 2025-01-06 10:52:20 -0800, Noah Misch wrote:
> > On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:

> - We have pretty no testing for IO errors.

Yes, that's remained a gap.  I've wondered how much to address this via
targeted tests of specific sites vs. fuzzing, iterative fault injection, or
some other approach closer to brute force.

> > I'd be most interested in the
> > cases that would be undetected deadlocks under a naive design.  An example
> > appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com
> 
> That's a good one, yea.
> 
> I think I'll try to translate the regression tests I wrote into an isolation
> test, I hope that'll make it a bit easier to cover more cases.
> 
> And then we'll need more injection points, I'm afraid :(.

Sounds good.

> >  * - method_*.c - different ways of executing AIO (e.g. worker process)
> >  * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
> >  * - aio_subject.c - callbacks at IO operation lifecycle events
> >  * - aio_init.c - per-fork and per-startup-process initialization
> 
> I don't particularly like "per-startup-process", because "global
> initialization" really is separate (and precedes) from startup processes
> startup. Maybe "per-server and per-backend initialization"?

That works for me.  I wrote "per-startup-process" because it can happen more
than once in a postmaster that reaches "all server processes terminated;
reinitializing".  That said, there's little risk of "per-server" giving folks
a materially wrong idea.

> >  * - aio.c - all other topics
> >  * - read_stream.c - helper for reading buffered relation data
> 
> Did the order you listed the files have a system to it? If so, what is it?

The rough idea was to avoid forward references:

 * - method_*.c - different ways of executing AIO (e.g. worker process)
makes sense without other background
 * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
refers to methods, so listed after methods
 * - aio_subject.c - callbacks at IO operation lifecycle events
refers to IO ops, so listed after aio_io.c
 * - aio_init.c - per-fork and per-startup-process initialization
no surprise that this code will exist somewhere, so list it lower to deemphasize it
 * - aio.c - all other topics
default route, hence last
 * - read_stream.c - helper for reading buffered relation data
could just as easily come first, not last
could be under a distinct heading like "Recommended abstractions:" 

> > I'd benefit from seeing things in this order:
> >
> > - "why"
> > - condensed usage example like manpage SYNOPSIS, comments and decls removed
> > - PgAioHandleState and discussion of valid transitions
> 
> Hm - why have PgAioHandleState and its states before the usage example?  Seems
> like it'd be harder to understand that way.

I usually look at the data structures before the code that manipulates them.
(Similarly, I look at the map before the directions.)  I wouldn't mind it
appearing after the usage example, since order preferences do vary.

> > - usage example as it is, with full comments
> > - the rest
> 
> 
> > ## Synopsis
> >
> > ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
> > pgaio_io_get_ref(ioh, &ior);
> > pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
> > pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
> > smgrstartreadv(ioh, operation->smgr, forknum, blkno,
> >                BufferGetBlock(buffer), 1);
> > pgaio_submit_staged();
> > pgaio_io_ref_wait(&ior);
> > if (ioret.result.status == ARS_ERROR)
> >     pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
> 
> Happy to add this, but I'm not entirely sure if that's really that useful to
> have without commentary? The synopsis in manpages is helpful because it
> provides the signature of various functions, but this wouldn't...

I'm not sure either.  Let's drop that idea.

> > > +### IO can be started in critical sections
> > ...
> > > +The need to be able to execute IO in critical sections has substantial design
> > > +implication on the AIO subsystem. Mainly because completing IOs (see prior
> > > +section) needs to be possible within a critical section, even if the
> > > +to-be-completed IO itself was not issued in a critical section. Consider
> > > +e.g. the case of a backend first starting a number of writes from shared
> > > +buffers and then starting to flush the WAL. Because only a limited amount of
> > > +IO can be in-progress at the same time, initiating the IO for flushing the WAL
> > > +may require to first finish executing IO executed earlier.
> >
> > The last line's two appearances of the word "execute" read awkwardly to me,
> > and it's an opportunity to use PgAioHandleState terms.  Consider writing the
> > last line like "may first advance an existing IO from AHS_PREPARED to
> > AHS_COMPLETED_SHARED".
> 
> It is indeed awkward.  I don't love referencing the state-constants here
> though, somehow that feels like a reference-cycle ;). What about this:
> 
> > ... Consider
> > e.g. the case of a backend first starting a number of writes from shared
> > buffers and then starting to flush the WAL. Because only a limited amount of
> > IO can be in-progress at the same time, initiating IO for flushing the WAL may
> > require to first complete IO that was started earlier.

That's non-awkward.  I like specific state names here since "complete" could
mean AHS_COMPLETED_SHARED or AHS_COMPLETED_LOCAL, and it matters here.  If the
state names changed so AHS_COMPLETED_LOCAL dropped the word "complete", that
too would solve it.

> > > +### AIO Callbacks
> > ...
> > > +In addition to completion, AIO callbacks also are called to "prepare" an
> > > +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
> > > +IO to/from shared buffers, which is required to handle the case where the
> > > +issuing backend errors out and releases its own pins.
> >
> > Reading this, it's not obvious to me how to reconcile "finishing an IO could
> > require pin acquisition" with "finishing an IO could happen in a critical
> > section".  Pinning a buffer in a critical section sounds bad.  I vaguely
> > recall understanding how it was okay as of my September review, but I've
> > already forgotten.  Can this text have a sentence making that explicit?
> 
> Ah, yes, that's easy to misunderstand. The answer basically is that we don't
> newly pin a buffer, we just increment the reference count by 1. That should
> never fail.
> 
> How about:
> > In addition to completion, AIO callbacks also are called to "prepare" an
> > IO. This is, e.g., used to increase buffer reference counts to account for the
> > AIO subsystem referencing the buffer, which is required to handle the case
> > where the issuing backend errors out and releases its own pins while the IO is
> > still ongoing.

Perfect.

> > > +### AIO Subjects
> > > +
> > > +In addition to the completion callbacks describe above, each AIO Handle has
> > > +exactly one "subject". Each subject has some space inside an AIO Handle with
> > > +information specific to the subject and can provide callbacks to allow to
> > > +reopen the underlying file (required for worker mode) and to describe the IO
> > > +operation (used for debug logging and error messages).
> >
> > Can this say roughly how to decide when to add a new subject?
> 
> Hm, there obviously is some fuzziness. I was trying to get to some of that by
> mentioning that the subject needs to know how to [re-]open a file and describe
> the target of the IO in terms that make sense to the user.
> 
> E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
> open a file by delegating that to the layer below and the layer above just
> knows about smgr, not md.c (or other potential smgr implementations).
> 
> The reason to keep this separate from the callbacks was that smgr IO going
> through shared buffers, bypassing shared buffers and different smgr
> implemenentations all could share the same subject implementation, even if
> callbacks would differ between these use cases.
> 
> 
> How about:
> 
> > I.e., if two different uses of AIO can describe the identity of the file being
> > operated on the same way, it likely makes sense to use the same
> > subject. E.g. different smgr implementations can describe IO with
> > RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
> > contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
> > and it would not make sense to use the same subject for smgr and WAL.

Sounds good to include.

> > Can this have a sentence on how this fits in bounded shmem, given the lack of
> > guarantees about a backend's responsiveness?  In other words, what makes it
> > okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
> > to AHS_COMPLETED_LOCAL?
> 
> I agree this should be explained somewhere - but not sure this is the best
> place.
> 
> The reason it's ok is that each backend has a limited number of AIO handles
> and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
> wait for the oldest IO to finish.

Reading it again today, that topic may already have adequate coverage.